1 Introduction

Curriculum learning (Bengio et al., 2009) refers to efficiently training effective neural networks by mimicking how humans learn, from easy to hard. As originally introduced by Bengio et al. (2009), curriculum learning is a training procedure that first organizes the examples in their increasing order of difficulty, then starts the training of the neural network on the easiest examples, gradually adding increasingly more difficult examples along the way, until all training examples are fed into the network. The success of the approach relies in avoiding imposing the learning of very difficult examples right from the beginning, instead guiding the model on the right path through the imposed curriculum. This type of curriculum is later referred to as data-level curriculum learning (Soviany et al., 2022). Indeed, Soviany et al. (2022) identified several types of curriculum learning approaches in the literature, dividing them into four categories based on the components involved in the definition of machine learning given by Mitchell (1997). The four categories are: data-level curriculum (examples are presented from easy to hard), model-level curriculum (the modeling capacity of the network is gradually increased), task-level curriculum (the complexity of the learning task is increased during training), objective-level curriculum (the model optimizes towards an increasingly more complex objective). While data-level curriculum is the most natural and direct way to employ curriculum learning, its main disadvantage is that it requires a way to determine the difficulty of data samples. Despite having many successful applications (Soviany et al., 2022; Wang et al., 2022), there is no universal way to determine the difficulty of the data samples, making the data-level curriculum less applicable to scenarios where the difficulty is hard to estimate, e.g. classification of radar signals. The task-level and objective-level curriculum learning strategies suffer from similar issues, e.g. it is hard to create a curriculum when the model has to learn an easy task (binary classification) or the objective function is already convex.

Fig. 1
figure 1

Training based on Learning Rate Curriculum

Considering the above observations, we recognize the potential of model-level curriculum learning strategies of being applicable across a wider range of domains and tasks. To date, there are only a few works (Burduja, 2021; Karras et al., 2018; Sinha et al., 2020) in the category of pure model-level curriculum learning methods. However, these methods have some drawbacks caused by their domain-dependent or architecture-specific design. To benefit from the full potential of the model-level curriculum learning category, we propose LeRaC (Learning Rate Curriculum), a novel and simple curriculum learning approach which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. This reduces the propagation of noise caused by the multiplication operations inside the network, a phenomenon that is more prevalent when the weights are randomly initialized. The learning rates increase at various paces during the first training iterations, until they all reach the same value, as illustrated in Fig. 1. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that is applicable to any domain and compatible with any neural network, generating higher performance levels regardless of the architecture, without adding any extra training time. To the best of our knowledge, we are the first to employ a different learning rate per layer to achieve the same effect as conventional (data-level) curriculum learning.

Fig. 2
figure 2

Convolving an image of a car with random noise filters progressively increases the level of noise in the features. A theoretical proof of this observation is given in “Appendix A”

As hinted above, the underlying hypothesis that justifies the use of LeRaC is that the level of noise grows from one neural layer to the next, especially when the input is multiplied with randomly initialized weights having low signal-to-noise ratios. We briefly illustrate this phenomenon through an example. Suppose an image x is successively convolved with a set of random filters \(c_1, c_2,\ldots , c_n\). Since the filters are uncorrelated, each filter distorts the image in a different way, degrading the information in x with each convolution. The information in x is gradually replaced by noise (see Fig. 2), i.e. the signal-to-noise ratio increases with each layer. Optimizing the filter \(c_n\) to learn a pattern from the image convolved with \(c_1, c_2,\ldots , c_{n-1}\) is suboptimal, because the filter \(c_{n}\) will adapt to the noisy (biased) activation map induced by filters \(c_1, c_2,\ldots , c_{n-1}\). This suggests that earlier filters need to be optimized sooner to reduce the level of noise of the activation map passed to layer n. In general, this phenomenon becomes more obvious as the layers get deeper, since the number of multiplication operations grows along the way. Hence, in the initial training stages, it makes sense to use gradually lower learning rates, as the layers get father away from the input. Our hypothesis is theoretically supported by Theorem 1, and empirically validated in “Appendix B”.

We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), Tiny ImageNet (Russakovsky et al., 2015), ImageNet-1K (Russakovsky et al., 2015), Food-101 (Bossard et al. 2014), UTKFace (Zhang et al., 2017), PASCAL VOC (Everingham et al., 2010)), language (BoolQ (Clark et al., 2019), QNLI (Wang et al., 2019), RTE (Wang et al., 2019)) and audio (ESC-50 (Piczak, 2015), CREMA-D (Cao et al., 2014)) domains, considering various convolutional (ResNet-18 (He et al., 2016), Wide-ResNet-50 (Zagoruyko & Komodakis, 2016), DenseNet-121 (Huang et al., 2017), YOLOv5 (Jocher et al., 2022)), recurrent (LSTM (Hochreiter & Schmidhuber, 1997)) and transformer (CvT (Wu et al., 2021), BERT (Devlin et al., 2019), SepTr (Ristea et al., 2022)) architectures. We compare our approach with the conventional training regime and Curriculum by Smoothing (CBS) (Sinha et al., 2020), our closest competitor. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time, since there is no additional cost over the conventional training regime for LeRaC, whereas CBS adds Gaussian smoothing layers. We also compare with several data-level and task-level curriculum learning methods (Dogan et al., 2020; Wang et al., 2023; Khan et al., 2024, 2023a, b), and show that our method scores best in most of the experiments.

In summary, our contribution is threefold:

  • We propose a novel and simple model-level curriculum learning strategy that creates a curriculum by updating the weights of each neural layer with a different learning rate, considering higher learning rates for the low-level feature layers and lower learning rates for the high-level feature layers.

  • We empirically demonstrate the applicability to multiple domains (image, audio and text), the compatibility to several neural network architectures (convolutional neural networks, recurrent neural networks and transformers), and the time efficiency (no extra training time added) of LeRaC through a comprehensive set of experiments.

  • We demonstrate our underlying hypothesis stating that the level of noise increases from one neural layer to another, both theoretically and empirically.

2 Related Work

2.1 Curriculum Learning

Curriculum learning was initially introduced by Bengio et al. (2009) as a training strategy that helps machine learning models to generalize better when the training examples are presented in the ascending order of their difficulty. Extensive surveys on curriculum learning methods, including the most recent advancements on the topic, were conducted by Soviany et al. (2022) and Wang et al. (2022). In the former survey, Soviany et al. (2022) emphasized that curriculum learning is not only applied at the data level, but also with respect to the other components involved in a machine learning approach, namely at the model level, the task level and the objective (loss) level. Regardless of the component on which curriculum learning is applied, the technique has demonstrated its effectiveness on a broad range of machine learning tasks, from computer vision (Bengio et al., 2009; Gui et al., 2017; Jiang et al., 2018; Shi & Ferrari, 2016; Soviany et al., 2021; Chen & Gupta, 2015; Sinha et al., 2020; Khan et al., 2024, 2023a, b) to natural language processing (Platanios et al., 2019; Kocmi & Bojar, 2017; Spitkovsky et al., 2009; Liu et al., 2018; Bengio et al., 2009) and audio processing (Ranjan & Hansen, 2018; Amodei et al., 2016).

The main challenge for the methods that build the curriculum at the data level is measuring the difficulty of the data samples, which is required to order the samples from easy to hard. Most studies have addressed the problem with human input (Pentina et al., 2015; Jiménez-Sánchez et al., 2019; Wei et al., 2021) or metrics based on domain-specific heuristics. For instance, the text length (Kocmi & Bojar, 2017; Cirik et al., 2016; Tay et al., 2019; Zhang et al., 2021) and the word frequency (Bengio et al., 2009; Liu et al., 2018) have been employed in natural language processing. In computer vision, the samples containing fewer and larger objects have been considered to be easier in some works (Soviany et al., 2021; Shi & Ferrari, 2016). Other solutions employed difficulty estimators (Ionescu et al., 2016) or even the confidence level of the predictions made by the neural network (Gong et al., 2016; Hacohen & Weinshall, 2019) to approximate the complexity of the data samples. Other studies (Khan et al., 2024, 2023a, b) used the error of a previously trained model to estimate the difficulty of each sample. Such solutions have shown their utility in specific application domains. Nonetheless, measuring the difficulty remains problematic when implementing standard (data-level) curriculum learning strategies, at least in some application domains. Therefore, several alternatives have emerged over time, handling the drawback and improving the conventional curriculum learning approach. In (Kumar et al. 2010), the authors introduced self-paced learning to evaluate the learning progress when selecting training samples. The method was successfully employed in multiple settings (Kumar et al., 2010; Gong et al., 2019; Fan et al., 2017; Li et al., 2016; Zhou et al., 2018; Jiang et al., 2015; Ristea & Ionescu, 2021). Furthermore, some studies combined self-paced learning with the traditional pre-computed difficulty metrics (Jiang et al., 2015; Ma et al., 2017). An additional advancement related to self-paced learning is the approach called self-paced learning with diversity (Jiang et al., 2014). The authors demonstrated that enforcing a certain level of variety among the selected examples can improve the final performance. Another set of methods that bypass the need for predefined difficulty metrics is known as teacher-student curriculum learning (Zhang et al., 2019; Wu et al., 2018). In this setting, a teacher network learns a curriculum to supervise a student neural network.

Closer to our work, a few methods (Karras et al., 2018; Sinha et al., 2020; Burduja, 2021) proposed to apply curriculum learning at the model level, by gradually increasing the learning capacity (complexity) of the neural architecture. Such curriculum learning strategies do not need to know the difficulty of the data samples, thus having a great potential to be useful in a broad range of tasks. For example, Karras et al. (2018) proposed to gradually add layers to generative adversarial networks during training, while increasing the resolution of the input images at the same time. They are thus able to generate realistic high-resolution images. However, their approach is not applicable to every domain, since there is no notion of resolution for some input data types, e.g. text. Sinha et al. (2020) presented a strategy that blurs the activation maps of the convolutional layers using Gaussian kernel layers, reducing the noisy information caused by the network initialization. The blur level is progressively reduced to zero by decreasing the standard deviation of the Gaussian kernels. With this mechanism, they obtain a training procedure that allows the neural network to see simple information at the start of the process and more intricate details towards the end. Curriculum by Smoothing (CBS) (Sinha et al., 2020) was only shown to be useful for convolutional architectures applied in the image domain. Although we found that CBS is applicable to transformers by blurring the tokens, it is not necessarily applicable to any neural architecture, e.g. standard feed-forward neural networks. As an alternative to CBS, Burduja (2021) proposed to apply the same smoothing process on the input image instead of the activation maps. The method was applied with success in medical image alignment. However, this approach is not applicable to natural language input, as it is not clear how to apply the blurring operation on the input text.

Different from Burduja (2021) and Karras et al. (2018), our approach is applicable to various domains, including but not limited to natural language processing, as demonstrated throughout our experiments. To the best of our knowledge, the only competing model-level curriculum method which is applicable to various domains is CBS (Sinha et al., 2020). Unlike CBS, LeRaC does not introduce new operations, such as smoothing with Gaussian kernels, during training. As such, our approach does not increase the training time with respect to the conventional training regime, as later shown in the experiments included in Sect. 4.

To classify our approach as a curriculum learning framework, we consider the extreme case when the learning rate is set to zero for later layers, which is equivalent to freezing those layers. This clearly reduces the learning capacity of the model. If layers are unfrozen one by one, the capacity of the model grows. LeRaC can be seen as a soft version of the model-level curriculum method described above. We thus classify LeRaC as a model-level curriculum method. However, our method can also be seen as a curriculum learning strategy that simplifies the optimization (Pentina et al., 2015; Jiménez-Sánchez et al., 2019; Wei et al., 2021; Kocmi & Bojar, 2017; Cirik et al., 2016; Tay et al., 2019; Zhang et al., 2021; Bengio et al., 2009; Liu et al., 2018) in the early training stages by restricting the model updates (in a soft manner) to certain directions (corresponding to the weights of the earlier layers). Due to the imposed soft restrictions (lower learning rates for deeper layers), the optimization is easier at the beginning. As the training progresses, all directions become equally important, and the network is permitted to optimize the loss function in any direction. As the number of directions grows, the optimization task becomes more complex (it is harder to find the optimum). Hence, a relationship to curriculum learning can be discovered by noting that the complexity of the optimization increases over time, just as in curriculum learning.

In summary, we consider that the simplicity of our approach comes with many important advantages: applicability to any domain and task, compatibility with any neural network architecture, and time efficiency (adds no extra training time). We support all these claims through the comprehensive experiments presented in Sect. 4.

2.2 Learning Rate Schedulers

There are some contributions (Singh et al., 2015; You et al., 2017) showing that using adaptive learning rates can lead to improved results. We explain how our method is different below. In (Singh et al., 2015), the main goal is increasing the learning rate of certain layers as necessary, to escape saddle points. Different from Singh et al. (2015), our strategy reduces the learning rates of deeper layers, introducing soft optimization restrictions in the initial training epochs. You et al. (2017) proposed to train models with very large batches using a learning rate for each layer, by scaling the learning rate with respect to the norms of the gradients. The goal of You et al. (2017) is to specifically learn models with large batch sizes, e.g. formed of 8K samples. Unlike You et al. (2017), we propose a more generic approach that can be applied to multiple architectures (convolutional, recurrent, transformer) under unrestricted training settings.

Gotmare et al. (2019) point out that learning rate with warm-up and restarts is an effective strategy to improve stability of training neural models using large batches. Different from LeRaC, this approach does not employ a different learning rate for each layer. Moreover, the strategy restarts the learning rate at different moments during the entire training process, while LeRaC is applied only during the first few training epochs.

2.3 Optimizers

We consider Adam (Kingma & Ba, 2015) and related optimizers as orthogonal approaches that perform the optimization rather than setting the learning rate. Our approach, LeRaC, only aims to guide the optimization during the initial training iterations by reducing the relevance of optimizing deeper network layers. Most of the baseline architectures used in our experiments are already based on Adam or some of its variations, e.g. AdaMax, AdamW (Loshchilov & Hutter, 2019). LeRaC is applied in conjunction with these optimizers, showing improved performance over various architectures and application domains. This supports our claim that LeRaC is an orthogonal contribution to the family of Adam optimizers.

3 Method

Deep neural networks are commonly trained on a set of labeled data samples denoted as:

$$\begin{aligned} S\!=\!\{(x_i, y_i) | x_i\!\in \!X, y_i\!\in \!Y, \forall i \in \{1,2,\ldots ,m \} \}, \end{aligned}$$
(1)

where m is the number of examples, \(x_i\) is a data sample and \(y_i\) is the associated label. The training process of a neural network f with parameters \(\theta \) consists of minimizing some objective (loss) function \({\mathcal {L}}\) that quantifies the differences between the ground-truth labels and the predictions of the model f:

$$\begin{aligned} \min _{\theta } \frac{1}{m} \sum _{i=1}^m {\mathcal {L}}\left( y_i, f(x_i, \theta ) \right) . \end{aligned}$$
(2)

The optimization is generally performed by some variant of Stochastic Gradient Descent (SGD), where the gradients are back-propagated from the neural layers closer to the output towards the neural layers closer to input through the chain rule. Let \(f_1, f_2,\ldots , f_n\) and \(\theta _1, \theta _2,\ldots , \theta _n\) denote the neural layers and the corresponding weights of the model f, such that the weights \(\theta _j\) belong to the layer \(f_j\), \(\forall j \in \{1, 2,\ldots ,n\}\). The output of the neural network for some training data sample \(x_i \in X\) is formally computed as follows:

$$\begin{aligned} {\hat{y}}_i\!=\!f (x_i, \theta )\!=\!f_n\!\left( \ldots f_2 \left( f_1 \left( x_i, \theta _1 \right) , \theta _2 \right) \ldots , \theta _n \right) \!. \end{aligned}$$
(3)

To optimize the model via SGD, the weights are updated as follows:

$$\begin{aligned} \theta _j^{(t+1)} = \theta _j^{(t)} - \eta ^{(t)} \cdot \frac{\partial {\mathcal {L}}}{\partial \theta _j^{(t)}}, \forall j \in \{1, 2,\ldots ,n\}, \end{aligned}$$
(4)

where t is the index of the current training iteration, \(\eta ^{(t)} > 0\) is the learning rate at iteration t, and the gradient of \({\mathcal {L}}\) with respect to \(\theta _j^{(t)}\) is computed via the chain rule. Before starting the training process, the weights \(\theta _j^{(0)}\) are commonly initialized with random values, e.g. using Glorot initialization (Glorot & Bengio, 2010).

Sinha et al. (2020) suggested that the random initialization of the weights produces a large amount of noise in the information propagated through the neural model during the early training iterations, which can negatively impact the learning process. Due to the feed-forward processing that involves several multiplication operations, we argue that the noise level grows with each neural layer, from \(f_j\) to \(f_{j+1}\). This statement is confirmed by the following theorem:

Theorem 1

Let \(s_1=u_1+z_1\) and \(s_2=u_2+z_2\) be two signals, where \(u_1\) and \(u_2\) are the clean components, and \(z_1\) and \(z_2\) are the noise components. The signal-to-noise ratio of the product between the two signals is lower than the signal-to-noise ratios of the two signals, i.e.:

$$\begin{aligned} {{\,\textrm{SNR}\,}}(s_1\cdot s_2) \le {{\,\textrm{SNR}\,}}(s_i), \forall i \in \{1, 2\}. \end{aligned}$$
(5)

Proof

The proof is given in “Appendix A”. \(\square \)

The same issue can occur if the weights are pre-trained on a distinct task, where the misalignment of the weights with a new task is likely higher for the high-level (specialized) feature layers. To alleviate this problem, we propose to introduce a curriculum learning strategy that assigns a different learning rate \(\eta _j\) to each layer \(f_j\), as follows:

$$\begin{aligned} \theta _j^{(t+1)} = \theta _j^{(t)} - \eta _j^{(t)} \cdot \frac{\partial {\mathcal {L}}}{\partial \theta _j^{(t)}}, \forall j \in \{1, 2,\ldots ,n\}, \end{aligned}$$
(6)

such that:

$$\begin{aligned}{} & {} \eta ^{(0)} \ge \eta _1^{(0)} \ge \eta _2^{(0)} \ge \cdots \ge \eta _n^{(0)}, \end{aligned}$$
(7)
$$\begin{aligned}{} & {} \eta ^{(k)} = \eta _1^{(k)} = \eta _2^{(k)} =\cdots = \eta _n^{(k)}, \end{aligned}$$
(8)

where \(\eta _j^{(0)}\) are the initial learning rates and \(\eta _j^{(k)}\) are the updated learning rates at iteration k. The condition formulated in Eq. (7) indicates that the initial learning rate \(\eta _j^{(0)}\) of a neural layer \(f_j\) gets lower as the level of the respective neural layer becomes higher (farther away from the input). With each training iteration \(t \le k\), the learning rates are gradually increased, until they become equal, according to Eq. (8). Thus, our curriculum learning strategy is only applied during the early training iterations, where the noise caused by the misfit (randomly initialized or pre-trained) weights is most prevalent. Hence, k is a hyperparameter of LeRaC that is usually adjusted such that \(k\ll T\), where T is the total number of training iterations.

At this point, various schedulers can be used to increase each learning rate \(\eta _j\) from iteration 0 to iteration k. We empirically observed that an exponential scheduler is a better option than linear or logarithmic schedulers. We thus propose to employ the exponential scheduler, which is based on the following rule:

$$\begin{aligned} \eta _j^{(l)}\!=\!\eta _j^{(0)}\!\cdot \!c^{\frac{l}{k} \cdot \left( \log _c \eta _j^{(k)} - \log _c \eta _j^{(0)} \right) }\!, \forall l\!\in \!\{0,1,\ldots ,k \}. \end{aligned}$$
(9)

We set \(c=10\) in Eq. (9) across all our experiments. This is because learning rates are usually expressed as a power of \(c=10\), e.g. \(10^{-4}\). If we start with a learning rate of \(\eta _j^{(0)}=10^{-8}\) for some layer j and we want to increase it to \(\eta _j^{(k)}=10^{-4}\) during the first 5 epochs (\(k=4\)), the intermediate learning rates generated via Eq. (9) are \(\eta _j^{(1)}\!=\!10^{-7}\), \(\eta _j^{(2)}\!=\!10^{-6}\), \(\eta _j^{(3)}\!=\!10^{-5}\) and \(\eta _j^{(4)}\!=\!10^{-4}\). We thus believe it is more intuitive to understand what happens when setting \(c=10\) in Eq. (9), as opposed to using some tuned value for c. To this end, we refrain from tuning c and fix it to \(c=10\).

In practice, we obtain optimal results by initializing the lowest learning rate \(\eta _n^{(0)}\) with a value that is around five or six orders of magnitude lower than \(\eta ^{(0)}\), while the highest learning rate \(\eta _1^{(0)}\) is always equal to \(\eta ^{(0)}\). Apart from such general practical notes, the exact LeRaC configuration for each neural architecture is established by tuning its two hyperparameters (k, \(\eta _n^{(0)}\)) on the available validation sets.

We underline that the output feature maps of a layer j are affected (i) by the misfit weights \(\theta _j^{(0)}\) of the respective layer, and (ii) by the input feature maps, which are in turn affected by the misfit weights of the previous layers \(\theta _1^{(0)},\ldots , \theta _{j-1}^{(0)}\). Hence, the noise affecting the feature maps increases with each layer processing the feature maps, being multiplied with the weights from each layer along the way. Our curriculum learning strategy imposes the training of the earlier layers at a faster pace, transforming the noisy weights into discriminative patterns. As noise from the earlier layer weights is eliminated, we train the later layers at faster and faster paces, until all learning rates become equal at epoch k.

From a technical point of view, we note that our approach can also be regarded as a way to guide the optimization, which we see as an alternative to loss function smoothing. The link between curriculum learning and loss smoothing is discussed by Soviany et al. (2022), who suggest that curriculum learning strategies induce a smoothing of the loss function, where the smoothing is higher during the early training iterations (simplifying the optimization) and lower to non-existent during the late training iterations (restoring the complexity of the loss function). LeRaC is aimed at producing a similar effect, but in a softer manner by dampening the importance of optimizing the weights of high-level layers in the early training iterations. Additionally, we empirically observe (see results in “Appendix B”) that LeRaC tends to balance the training pace of low-level and high-level features, while the conventional regime seems to update the high-level layers at a faster pace. This could provide an additional intuitive explanation of why our method works better.

4 Experiments

4.1 Data Sets

We perform experiments on 12 benchmarks: CIFAR-10 (Krizhevsky, 2009), CIFAR-100 (Krizhevsky, 2009), Tiny ImageNet (Russakovsky et al., 2015), ImageNet-1K (Russakovsky et al., 2015), Food-101 (Bossard et al., 2014), UTKFace (Zhang et al., 2017), PASCAL VOC 2007+2012 (Everingham et al., 2010), BoolQ (Clark et al., 2019), QNLI (Wang et al., 2019), RTE (Wang et al., 2019), CREMA-D (Cao et al., 2014), and ESC-50 (Piczak, 2015). We adopt the official data splits for the 12 benchmarks considered in our experiments. When a validation set is not available, we keep \(10\%\) of the training data for validation.

CIFAR-10. CIFAR-10 (Krizhevsky, 2009) is a popular data set for object recognition in images. It consists of 60,000 color images with a resolution of \(32 \times 32\) pixels. An image depicts one of 10 object classes, each class having 6000 examples. We use the official data split with a training set of 50,000 images and a test set of 10,000 images.

CIFAR-100. The CIFAR-100 (Krizhevsky, 2009) data set is similar to CIFAR-10, except that it has 100 classes with 600 images per class. There are 50,000 training images and 10,000 test images.

Tiny ImageNet. Tiny ImageNet is a subset of ImageNet-1K (Russakovsky et al., 2015) which provides 100,000 training images, 25,000 validation images and 25,000 test images representing objects from 200 different classes. The size of each image is \(64 \times 64\) pixels.

ImageNet. ImageNet-1K (Russakovsky et al., 2015) is the most popular bemchmark in computer vision, comprising about 1.2 million images from 1000 object categories. We set the resolution of all images to \(224 \times 224\) pixels.

Food-101. Food-101 Bossard et al. (2014) is a data set that contains images from 101 food categories. For each category, there are 750 training images and 250 test images. Thus, the total number of images is 101,000. We resize all images to \(224 \times 224\) pixels. The test set is manually cleaned, while the training set is purposely left uncurated, being affected by labeling noise. This makes Food-101 suitable for testing the robustness of models to labeling noise.

UTKFace. The UTKFace data set (Zhang et al., 2017) contains face images representing various gender, age and ethnic groups. It consists of 23,709 images of \(200 \times 200\) pixels. The data set is divided into 16,597 training images, 3556 validation images, and 3556 test images. Each image is annotated with the corresponding age and gender label, which makes UTKFace suitable for evaluating models in a multi-task learning setup.

PASCAL VOC 2007+2012. One of the most popular benchmarks for object detection is PASCAL VOC (Everingham et al., 2010). The data set consists of 21,503 images which are annotated with bounding boxes for 20 object categories. The official split has 16,551 training images and 4952 test images.

BoolQ. BoolQ (Clark et al., 2019) is a question answering data set for yes/no questions containing 15,942 examples. The questions are naturally occurring, being generated in unprompted and unconstrained settings. Each example is a triplet of the form: {question, passage, answer}. We use the data split provided in the SuperGLUE benchmark (Wang et al., 2019), containing 9427 examples for training, 3270 for validation and 3245 for testing.

Table 1 Optimal hyperparameter settings for the various neural architectures used in our experiments

QNLI. The QNLI (Question-answering Natural Language Inference) data set (Wang et al., 2019) is a natural language inference benchmark automatically derived from SQuAD (Rajpurkar et al., 2016). The data set contains {question, sentence} pairs and the task is to determine whether the context sentence contains the answer to the question. The data set is constructed on top of Wikipedia documents, each document being accompanied, on average, by 4 questions. We consider the data split provided in the GLUE benchmark (Wang et al., 2019), which comprises 104,743 examples for training, 5463 for validation and 5463 for testing.

RTE. Recognizing Textual Entailment (RTE) (Wang et al., 2019) is a natural language inference data set containing pairs of sentences with the target label indicating if the meaning of one sentence can be inferred from the other. The training subset includes 2490 samples, the validation set 277 samples, and the test set 3000 samples.

CREMA-D. The CREMA-D multi-modal database (Cao et al., 2014) is formed of 7442 videos of 91 actors (48 male and 43 female) of different ethnic groups. The actors perform various emotions while uttering 12 particular sentences that evoke one of the 6 emotion categories: anger, disgust, fear, happy, neutral, and sad. Following previous work (Ristea & Ionescu, 2021), we conduct experiments only on the audio modality, dividing the set of audio samples into \(70\%\) for training, \(15\%\) for validation and \(15\%\) for testing.

ESC-50. The ESC-50 (Piczak, 2015) data set is a collection of 2000 samples of 5 s each, comprising 50 classes of various common sound events. Samples are recorded at a 44.1 kHz sampling frequency, with a single channel. In our evaluation, we employ the 5-fold cross-validation procedure, as described in related works (Piczak, 2015; Ristea et al., 2022).

4.2 Experimental Setup

Architectures. To demonstrate the compatibility of LeRaC with multiple neural architectures, we select several convolutional, recurrent and transformer models. As representative convolutional neural networks (CNNs), we opt for ResNet-18 (He et al., 2016), Wide-ResNet-50 (Zagoruyko & Komodakis, 2016) and DenseNet-121 (Huang et al., 2017). For the object detection experiments on PASCAL VOC, we use the YOLOv5 (Jocher et al., 2022) model based on the CSPDarknet53 (Wang et al., 2020) backbone, which is pre-trained on the MS COCO data set (Lin et al., 2014). As representative transformers, we consider CvT-13 (Wu et al., 2021), \(\textrm{BERT}_{{\mathrm{uncased-large}}}\) (Devlin et al., 2019) and SepTr (Ristea et al., 2022). For CvT, we consider both pre-trained and randomly initialized versions. We use an uncased large pre-trained version of BERT. As Ristea et al. (2022), we train SepTr from scratch. In addition, we employ a long short-term memory (LSTM) network (Hochreiter & Schmidhuber, 1997) to represent recurrent neural networks (RNNs). The recurrent neural network contains two LSTM layers, each having a hidden dimension of 256 components. These layers are preceded by one embedding layer with the embedding size set to 128 elements. The output of the last recurrent layer is passed to a classifier composed of two fully connected layers. The LSTM is activated by rectified linear units (ReLU). We apply the aforementioned models on distinct input data types, considering the intended application domain of each model. Hence, ResNet-18, Wide-ResNet-50, CvT and YOLOv5 are applied on images, BERT and LSTM are applied on text, and SepTr and DenseNet-121 are applied on audio.

Multi-task architectures. To determine the impact of LeRaC on multi-task learning models, we conduct experiments on the UTKFace data set, where the face images are annotated with gender and age labels. We consider two models for the multi-task learning setup, namely ResNet-18 and CvT-13. Each model is jointly trained on the two tasks (gender prediction and age estimation). To each model, we attach two heads, one for gender classification and one for age estimation, respectively. The classification head is trained using the cross-entropy loss with respect to the gender label, while the regression head uses the mean squared error with respect to the age label. The models are trained using a joint objective defined as follows:

$$\begin{aligned} {\mathcal {L}}_{{\textrm{MTL}}} \!=\! \frac{1}{m} \sum _{i=1}^m {\mathcal {L}}_{{\textrm{CE}}} \left( y^{g}_i, {\hat{y}}^{g}_i \right) \!+\! \lambda \!\cdot \!{\mathcal {L}}_{{\textrm{MSE}}} \left( y^{a}_i, {\hat{y}}^{a}_i \right) , \end{aligned}$$
(10)

where \(y^{g}_i\) and \(y^{a}_i\) are the ground-truth gender and age labels, \({\hat{y}}^{g}_i\) and \({\hat{y}}^{a}_i\) are the predicted gender and age labels, \(\lambda \in {\mathbb {R}}^+\) is a weight factor, and \({\mathcal {L}}_{{\textrm{CE}}}\) is the cross-entropy loss for the gender prediction task, defined as:

$$\begin{aligned} {\mathcal {L}}_{{\textrm{CE}}}\! \left( y^{g}_i, {\hat{y}}^{g}_i \right) \!=\!-\left( y^g_i \log ({\hat{y}}^g_i)\!+\!(1\!-\!y^g_i) \log (1 - {\hat{y}}^g_i) \right) , \end{aligned}$$
(11)

and \({\mathcal {L}}_{{\textrm{MSE}}}\) is the mean squared error for the age estimation task, defined as:

$$\begin{aligned} {\mathcal {L}}_{{\textrm{MSE}}} \left( y^{a}_i, {\hat{y}}^{a}_i \right) = (y^{a}_i - {\hat{y}}^{a}_i)^2. \end{aligned}$$
(12)

The factor \(\lambda \) ensures the two tasks are equally important by weighting \({\mathcal {L}}_{{\textrm{MSE}}}\) to have approximately the same range of values as \({\mathcal {L}}_{{\textrm{CE}}}\). As such, we set \(\lambda =10\).

Baselines. We compare LeRaC with two baselines: the conventional training regime (which uses early stopping, reduces the learning rate on plateau, and employs linear warm-up and cosine annealing when required) and the state-of-the-art Curriculum by Smoothing (Sinha et al., 2020). For CBS, we use the official code released by Sinha et al. (2020) at https://github.com/pairlab/CBS, to ensure the reproducibility of their method in our experimental settings, which include a more diverse selection of input data types and neural architectures. In addition, we compare with several data-level and task-level curriculum learning methods (Dogan et al., 2020; Wang et al., 2023; Khan et al., 2023a, b, 2024) on CIFAR-10 and CIFAR-100.

To apply CBS to non-convolutional architectures, we use 1D convolutional layers based on Gaussian filters with a receptive field of 3. For transformers, we integrate a 1D Gaussian layer before each transformer block, so the smoothing is applied on the sequence of tokens. Similarly, for recurrent neural networks, before each LSTM layer, we process the sequence of tokens with 1D convolutional layers based on Gaussian filters. For both transformers and RNNs, we anneal, during training, the standard deviation of the Gaussian filters to enhance the information propagated through the network. This approach mirrors the implementation of CBS for convolutional neural networks.

Hyperparameter tuning. We tune all hyperparameters on the validation set of each benchmark. In Table 1, we present the optimal hyperparameters chosen for each architecture. In addition to the standard parameters of the training process, we report the parameters that are specific for the CBS (Sinha et al., 2020) and LeRaC strategies. In the case of CBS, \(\sigma \) denotes the standard deviation of the Gaussian kernel, d is the decay rate for \(\sigma \), and u is the decay step. Regarding the parameters of LeRaC, k represents the number of iterations used in Eq. (9), and \(\eta _1^{(0)}\) and \(\eta _n^{(0)}\) are the initial learning rates for the first and last layers of the architecture, respectively. We set \(\eta _1^{(0)} = \eta ^{(0)}\) and \(c= 10\) in all experiments, without tuning. In addition, the intermediate learning rates \(\eta _j^{(0)}\), \(\forall j \in \{2, 3,\ldots ,n-1\}\), are automatically set to be equally distanced between \(\eta _1^{(0)}\) and \(\eta _n^{(0)}\). Moreover, \(\eta _j^{(k)} = \eta ^{(0)}\), i.e. the initial learning rates of LeRaC converge to the original learning rate set for the conventional training regime. All models are trained with early stopping and the learning rate is reduced by a factor of 10 when the loss reaches a plateau. We use linear warm-up with cosine annealing, whenever it is found useful for models based on conventional or CBS training. The learning rate warm-up is switched off for LeRaC to avoid unwanted interactions with our training strategy. Except for the pre-trained models, the weights of all models are initialized with Glorot initialization (Glorot & Bengio, 2010).

We underline that some parameters are the same across all data sets, while others need to be established per data set. For example, the parameter u of CBS and the parameter k of LeRaC are validated on each data set. As such, for the ResNet-18 model, the parameter u of CBS takes one value on each data set (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet, Food-101, UTKFace), but the values of u on all five data sets can range between 2 and 5. Similarly, the parameter k of LeRaC takes one value per data set, with the range of values being 5–7. In Table 1, we aggregate the optimal parameters of each model for all data sets. This explains why some hyperparameters are specified in terms of ranges.

Setting the initial learning rates. We should emphasize that the different learning rates \(\eta _j^{(0)}\), \(\forall j \in \{1,2,\ldots ,n \}\), are not optimized nor tuned during training. Instead, we set the initial learning rates \(\eta _j^{(0)}\) through validation, such that \(\eta _n^{(0)}\) is around five or six orders of magnitude lower than \(\eta ^{(0)}\), and \(\eta _1^{(0)}=\eta ^{(0)}\). After initialization, we apply our exponential scheduler, until all learning rates become equal at iteration k. In addition, we would like to underline that the difference \(\delta \) between the initial learning rates of consecutive layers is automatically set based on the range given by \(\eta _1^{(0)}\) and \(\eta _n^{(0)}\). For example, let us consider a network with 5 layers. If we choose \(\eta _1^{(0)}=10^{-1}\) and \(\eta _5^{(0)}=10^{-2}\), then the intermediate initial learning rates are automatically set to \(\eta _2^{(0)}=10^{-1.25}\), \(\eta _3^{(0)}=10^{-1.5}\), \(\eta _4^{(0)}=10^{-1.75}\), i.e. \(\delta \) is used in the exponent and is equal to \(-0.25\) in this case. To obtain the intermediate learning rates according to this example, we actually apply the exponential scheduler defined in Eq. (9). This reduces the number of tunable hyperparameters from n (the number layers) to two, namely \(\eta _1^{(0)}\) and \(\eta _n^{(0)}\). We go even further, setting \(\eta _1^{(0)}=\eta ^{(0)}\) without tuning, in all our experiments. Hence, tuning is only performed for the initial learning rate of the last layer, namely \(\eta _n^{(0)}\). Although tuning all \(\eta _j^{(0)}\), \(\forall j \in \{1, 2,\ldots ,n\}\), might lead to better results, we refrain from meticulously tuning every possible value to avoid overfitting in hyperparameter space.

Number of hyperparameters. We further emphasize that LeRaC adds only two additional tunable hyperparameters with respect to the conventional training regime. These are the lowest learning rate \(\eta _n^{(0)}\) and the number of iterations k to employ LeRaC. We reduce the number of hyperparameters that require tuning by using a fixed rule to adjust the intermediate learning rates, e.g. by employing an exponential scheduler, or by fixing some hyperparameters, e.g. \(c=10\). In contrast, CBS (Sinha et al., 2020) has three additional hyperparameters, thus having one extra hyperparameter compared with LeRaC. Furthermore, we note that data-level curriculum methods also introduce additional hyperparameters. Even a simple method that splits the examples into easy-to-hard batches that are gradually added to the training set requires at least two hyperparameters: the number of batches, and the number of iterations before introducing a new training batch. We thus believe that, in terms of the number of additional hyperparameters, LeRaC is comparable to CBS and other curriculum learning strategies. We emphasize that the same happens if we look at new optimizers, e.g. Adam (Kingma & Ba, 2015) adds three additional hyperparameters compared with SGD.

Table 2 Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100 and Tiny ImageNet for various neural models based on different training regimes: learning rate decay, linear warm-up, cosine annealing, constant learning rate, and LeRaC

Avoiding too large learning rates. In principle, a larger learning rate implies a larger update. However, if the learning rate is too high, the model can actually diverge. This is because the gradient describes the loss function in the vicinity of the current location, providing no guarantee for the value of the loss outside this vicinity. Our implementation takes this aspect into account. Instead of increasing the learning rate for earlier layers, we reduce the learning rate for the deeper layer to avoid divergence. More precisely, we set the learning rate for the first layer \(\eta _1^{(0)}\) to the original learning rate \(\eta ^{(0)}\) and the other initial learning rates are gradually reduced with each layer. During training, the lower learning rates are gradually increased, until epoch k. Hence, LeRaC actually slows down the learning for deeper layers, until the earlier layers have learned representative features.

Evaluation. For the classification tasks, we evaluate all models in terms of the accuracy rate. For the regression task (age estimation), we use the mean absolute error. For the object detection task, we employ the mean Average Precision (mAP) at an intersection over union (IoU) threshold of 0.5. We repeat the training process of each model for 5 times and report the average performance and the standard deviation.

Table 3 Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet and Food-101 for various neural models based on different training regimes: conventional, CBS (Sinha et al., 2020) and LeRaC

4.3 Domain-Specific Preprocessing

Image preprocessing. For the image classification experiments, we apply the same data preprocessing approach as Sinha et al. (2020). Hence, we normalize the images and maintain their original resolution, \(32 \times 32\) pixels for CIFAR-10 and CIFAR-100, \(64 \times 64\) pixels for Tiny ImageNet, \(224 \times 224\) pixels for ImageNet and Food-101, and \(200 \times 200\) pixels for UTKFace. Similar to Sinha et al. (2020), we do not employ data augmentation.

Text preprocessing. For the text classification experiments with BERT, we lowercase all words and add the classification token ([CLS]) at the start of the input sequence. We add the separator token ([SEP]) to delimit sentences. For the LSTM network, we lowercase all words and replace them with indexes from vocabularies constructed from the training set. The input sequence length is limited to 512 tokens for BERT and 200 tokens for LSTM.

Speech preprocessing. The speech preprocessing steps are carried out following Ristea et al. (2022). We thus transform each audio sample into a time-frequency matrix by computing the discrete Short Time Fourier Transform (STFT) with \(N_x\) FFT points, using a Hamming window of length L and a hop size R. For CREMA-D, we first standardize all audio clips to a fixed dimension of 4 seconds by padding or clipping the samples. Then, we apply the STFT with \(N_x=1024\), \(R=64\) and a window size of \(L=512\). For ESC-50, we keep the same values for \(N_x\) and L, but we increase the hop size to \(R = 128\). Next, for each STFT, we compute the square root of the magnitude and map the values to 128 Mel bins. The result is converted to a logarithmic scale and normalized to the interval [0, 1]. Furthermore, in all our speech classification experiments, we use the following data augmentation methods: noise perturbation, time shifting, speed perturbation, mix-up and SpecAugment (Park et al., 2019).

Table 4 Multi-task learning results for ResNet-18 and CvT-13 (pre-trained) on UTKFace, using three different training regimes: conventional, CBS (Sinha et al., 2020) and LeRaC
Table 5 Object detection results of YOLOv5 on PASCAL VOC, using three different training regimes: conventional, CBS (Sinha et al., 2020) and LeRaC
Table 6 Left side: average accuracy rates (in %) over 5 runs on BoolQ, RTE and QNLI for BERT and LSTM
Fig. 3
figure 3

Validation accuracy (on the y-axis) versus training time (on the x-axis) for four distinct architectures. The number of training epochs is the same for both LeRaC and CBS, the observable time difference being caused by the overhead of CBS due to the Gaussian kernel layers

4.4 Preliminary Results

We present preliminary experiments to show the effect of various learning rate schedulers for different architectures. For each architecture, we compare the constant learning rate scheduler with an adaptive learning rate scheduler. The aim is to find the best scheduler for the conventional training regime, which is used as baseline in the subsequent experiments. Table 2 showcases the preliminary results on CIFAR-10, CIFAR-100 and Tiny ImageNet. We compare the outcomes of the adaptive and constant learning rate schedulers with those of LeRaC. In most cases, the adaptive scheduler yields better results than the constant learning rate. Using a constant learning rate seems to work only for the pre-trained CvT-13. Notably, the analysis also reveals that LeRaC consistently outperforms the other baseline schedulers, achieving the highest accuracy rates across all data sets.

We emphasize that, for the subsequent experiments, the conventional regime is always represented by the best scheduler among the following options: learning rate decay, learning rate warm-up, cosine annealing, or combinations of the aforementioned options.

4.5 Main Results

Image classification. In Table 3, we present the image classification results on CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet and Food-101. Since CvT-13 is pre-trained on ImageNet, it does not make sense to fine-tune it on ImageNet. Thus, the respective results are not reported. On the one hand, there are two scenarios (ResNet-18 on CIFAR-100, and CvT-13 on CIFAR-100) in which CBS provides the largest improvements over the conventional regime, surpassing LeRaC in the respective cases. On the other hand, there are more than 10 scenarios where CBS degrades the accuracy with respect to the standard training regime. This shows that the improvements attained by CBS are inconsistent across models and data sets. Unlike CBS, our strategy surpasses the baseline regime in all 19 cases, thus being more consistent. In 8 of these cases, the accuracy gains of LeRaC are higher than \(1\%\). Moreover, LeRaC outperforms CBS in 17 out of 19 cases. We thus consider that LeRaC can be regarded as a better choice than CBS, bringing consistent performance gains.

Multi-task learning. In Table 4, we include the multi-task learning results on the UTKFace data set (Zhang et al., 2017). We evaluate the multi-task ResNet-18 and \(\text{ CvT-13}_{{\text{ pre-trained }}}\) models under various training regimes, reporting the accuracy rates for gender prediction, and the mean absolute errors for age estimation, respectively. LeRaC achieves the best scores in each and every case, surpassing the other training regimes in the multi-task learning setup. Moreover, its results are statistically better with respect to both competing regimes. In contrast, the CBS regime remains in the statistical margin of the conventional regime for the pre-trained CvT-13 network.

Object detection. In Table 5, we include the object detection results of YOLOv5 (Jocher et al., 2022) based on different training regimes on PASCAL VOC 2007+2012 (Everingham et al., 2010). LeRaC exhibits a superior mAP score, significantly surpassing the other training regimes. In contrast, CBS leads to suboptimal performance, hinting towards the inconsistency of CBS across different evaluation scenarios.

Text classification. In Table 6 (left side), we report the text classification results on BoolQ, RTE and QNLI. Here, there are two cases (BERT on QNLI and LSTM on RTE) where CBS leads to performance drops compared with the conventional training regime. In all other cases, the improvements of CBS are below \(0.6\%\). Just as in the image classification experiments, LeRaC brings accuracy gains for each and every model and data set. In four out of six scenarios, the accuracy gains yielded by LeRaC are higher than \(1.3\%\). Once again, LeRaC proves to be the most consistent regime, generally surpassing CBS by significant margins.

Speech classification. In Table 6 (right side), we present the results obtained on the audio data sets, namely CREMA-D and ESC-50. We observe that the CBS strategy obtains lower results compared with the baseline in two cases (SepTr on CREMA-D and DenseNet-121 on ESC-50), while our method provides superior results for each and every case. By applying LeRaC on SepTr, we set a new state-of-the-art accuracy level (\(70.95\%\)) on the CREMA-D audio modality, surpassing the previous state-of-the-art value attained by Ristea et al. (2022) with SepTr alone. When applied on DenseNet-121, LeRaC brings performance improvements higher than \(1\%\), the highest improvement (\(1.78\%\)) over the baseline being attained on CREMA-D.

Significance testing. To determine if the reported accuracy gains observed for LeRaC with respect to the baseline are significant, we apply McNemar/Cochran Q significance testing (Dietterich, 1998) to the results reported in Tables 3, 4, 5 and 6 on all 12 data sets. In 27 of 34 cases, we found that our results are significantly better than the corresponding baseline, at a p-value of 0.001. This confirms that our gains are statistically significant in the majority of cases.

Table 7 Average accuracy rates (in %) over 5 runs for ResNet-18, Wide-ResNet-50 and CvT-13 (pre-trained) on CIFAR-10 and CIFAR-100 using different training regimes: conventional, CBS (Sinha et al., 2020), LSCL (Dogan et al., 2020), EfficientTrain (Wang et al., 2023), Self-taught (Khan et al., 2024), CLIP (Khan et al., 2023a), LCDnet-CL (Khan et al., 2023b) and LeRaC (ours)

Training time comparison. For a particular model and data set, all training regimes are executed for the same number of epochs, for a fair comparison. However, the CBS strategy adds the smoothing operation at multiple levels inside the architecture, which increases the training time. To this end, we compare the training time (in hours) versus the validation error of CBS and LeRaC. For this experiment, we selected four neural models and illustrate the evolution of the validation accuracy over time in Fig. 3. We underline that LeRaC introduces faster convergence times, being around 7-12% faster than CBS. It is trivial to note that LeRaC requires the same time as the conventional regime.

4.6 More Comparative Results

Comparing with domain-specific curriculum learning strategies. Although we consider CBS (Sinha et al., 2020) as our closest competitor in terms of applicability across architectures and domains, there are domain-specific curriculum learning methods reporting promising results. To this end, we perform additional experiments on CIFAR-10 and CIFAR-100 with ResNet-18, Wide-ResNet-50 and CvT-13 (pre-trained), considering two recent curriculum learning strategies applied in the image domain, namely Label-Similarity Curriculum Learning (LSCL) (Dogan et al., 2020) and EfficientTrain (Wang et al., 2023).

Table 8 Average accuracy rates (in %) over 5 runs on CIFAR-10, CIFAR-100 and Tiny ImageNet for CvT-13 based on different training regimes: conventional, LeRaC with logarithmic update, LeRaC with linear update, and LeRaC with exponential update (proposed)

Dogan et al. (2020) proposed LSCL, a strategy that relies on hierarchically clustering the classes (labels) based on inter-label similarities determined with the help of document embeddings representing the Wikipedia pages of the respective classes. The corresponding results shown in Table 7 indicate that label-similarity curriculum is useful for CIFAR-100, but not for CIFAR-10. This suggests that the method needs a sufficiently large number of classes to benefit from the constructed hierarchy of classes. In contrast, LeRaC does not rely on external components, such as the similarity measure used by Dogan et al. (2020) in their strategy. Another important limitation of LSCL (Dogan et al., 2020) is its restricted use, e.g. LSCL is not applicable to regression tasks, where there are no classes. Therefore, we consider LeRaC as a more versatile alternative.

EfficientTrain is an alternative to CBS, which introduces a cropping operation in the Fourier spectrum of the inputs instead of blurring the activation maps. The method is not suitable for text data, so the comparison between EfficientTrain and LeRaC can only be performed in the image domain. Consequently, we compare with EfficientTrain (Wang et al., 2023) on CIFAR-10 and CIFAR-100, and show the corresponding results in Table 7. Notably, our method surpasses EfficientTrain (Wang et al., 2023) in 4 out of 6 evaluation scenarios. These results confirm the competitiveness of LeRaC in comparison to very recent methods, such as EfficientTrain (Wang et al., 2023).

Aside from outperforming EfficientTrain and LSCL in the image domain, our method has another important advantage: it is generally applicable to any domain.

Comparing with data-level curriculum learning methods. In Table 7, we also compare LeRaC with three data-level curriculum learning methods (Khan et al., 2024, 2023a, b). These methods share a common framework, where a scoring function ranks samples based on their difficulty, and a pacing function determines the timing for introducing new batches during training. Khan et al. (2024) examine various pacing functions and classify scoring functions into two categories: self-taught and transfer-scoring functions. Self-taught functions involve training a model on a subset of data batches and then using this model to assess the difficulty of examples. In contrast, transfer-scoring functions utilize a pre-trained model for this purpose. For the results reported in Table 7 for Khan et al. (2024), we use the self-taught scoring function and a linear pacing function. To compare with Khan et al. (2023b), we use a transfer-scoring function and a ResNet-50 model pre-trained on ImageNet. For Khan et al. (2023a), aside from using the pre-trained model for assessing the difficulty of the samples, we also remove the least significant samples during training.

The results reported in Table 7 indicate that LeRaC outperforms the data-level curriculum learning methods. We note that these methods were exclusively tested on crowd density estimation tasks, which could explain why their effectiveness might not generalize to different types of tasks. For instance, the method described by Khan et al. (2023a) is suboptimal even when compared with conventional training, suggesting that the strategy of removing easy examples is not always effective for image classification tasks.

Fig. 4
figure 4

Average SNR of the feature maps at each layer of the randomly initialized LeNet architecture. The SNR at each layer is averaged for 100 randomly picked images from the CIFAR-100 data set. For the later layers, the SNR is negative because the signal is dominated by noise

4.7 Ablation Studies

Comparing different schedulers. We first aim to establish if the exponential learning rate scheduler proposed in Eq. (9) is a good choice. To test this out, we select the CvT-13 model and change the LeRaC regime to use linear or logarithmic updates of the learning rates. The corresponding results are shown in Table 8. We observe that both alternative schedulers obtain performance gains, but our exponential learning rate scheduler brings higher gains on all three data sets. We thus conclude that the update rule defined in Eq. (9) is a sound option.

Our previous ablation study shows that the exponential scheduler leads to higher gains than the linear or the logarithmic schedulers. In general, a suitable scheduler is one that adjusts the learning rate at each layer proportionally to the estimated signal-to-noise drop from one layer to the next. To understand how the average SNR drops from one neural layer to the next, we plot the average SNR of the features maps at each layer of the randomly initialized LeNet architecture, computed over 100 images from CIFAR-100, in Fig. 4. As anticipated, the average SNR decreases along with the layer index. Notably, we observe that the drop in SNR follows an exponential trend. This can explain why the exponential scheduler is a more suitable choice.

Fig. 5
figure 5

Test accuracy (on the y-axis) versus training time (on the x-axis) for ResNet-18 on CIFAR-100 with various curriculum schedulers. The dashed line corresponds to the conventional regime, while the continuous lines correspond to LeRaC with various schedulers. Best viewed in color (Color figure online)

Fig. 6
figure 6

Test accuracy (on the y-axis) versus training time (on the x-axis) for the pre-trained CvT-13 on CIFAR-10 with various curriculum schedulers. The dashed line corresponds to the conventional regime, while the continuous lines correspond to LeRaC with various schedulers. Best viewed in color (Color figure online)

To further justify our preference towards the exponential scheduler, we analyze the training progress of the ResNet-18 and the pre-trained CvT-13 models using various schedulers (logarithmic, linear, exponential) for LeRaC. Figure 5 shows the results for ResNet-18, while Fig. 6 illustrates the results for CvT-13. In both cases, the exponential scheduler leads to a better training progress than the conventional regime, but the linear and logarithmic schedulers are not as good. These results further confirm that the exponential scheduler is optimal.

Table 9 Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100 based on different ranges for the initial learning rates

Varying value ranges for initial learning rates. All our hyperparameters are either fixed without tuning or tuned on the validation data. In this ablation experiment, we present results with LeRaC using multiple ranges for \(\eta _1^{(0)}\) and \(\eta _n^{(0)}\) to demonstrate that LeRaC is sufficiently stable with respect to suboptimal hyperparameter choices. We carry out experiments with ResNet-18 and Wide-ResNet-50 on CIFAR-100. We report the corresponding results in Table 9. We observe that all hyperparameter configurations lead to surpassing the baseline regime. This indicates that LeRaC can bring performance gains even outside the optimal learning rate bounds, demonstrating low sensitivity to suboptimal hyperparameter tuning.

Table 10 Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100 using the LeRaC regime until iteration k, while varying k
Table 11 Average accuracy rates (in %) over 5 runs for ResNet-18 and Wide-ResNet-50 on CIFAR-100, as well as SepTr on CREMA-D, based on different training regimes: conventional, anti-LeRaC and LeRaC

Varying \({\textbf{k}}\). In Table 10, we present additional results with ResNet-18 and Wide-ResNet-50 on CIFAR-100, considering various values for k (the last iteration for our training regime). We observe that all configurations surpass the baselines on CIFAR-100. Moreover, we observe that the optimal values for k (\(k=7\) for ResNet-18 and \(k=7\) for Wide-ResNet-50) obtained on the validation set are not the values producing the best results on the test set. This confirms that we did not overfit the hyperparameters of LeRaC.

Anti-curriculum. Since our goal is to perform curriculum learning (from easy to hard), we restrict the settings for \(\eta _j\), \(\forall j \in \{1,2,\ldots ,n \}\), such that deeper layers start with lower learning rates. However, another strategy is to consider the opposite setting, where we use higher learning rates for deeper layers. If we train later layers at a faster pace (anti-curriculum), we conjecture that the later layers get adapted to the noise from the early layers, which could likely lead to local optima or difficult training (due to the need of readapting to the earlier layers, once these layers start learning useful features). We tested this approach (anti-LeRaC), which belongs to the category of anti-curriculum learning strategies (Soviany et al., 2022), in a set of new experiments with ResNet-18 and Wide-ResNet-50 on CIFAR-100, as well as SepTr on CREMA-D. We report the corresponding results with LeRaC and anti-LeRaC in Table 11. Although anti-curriculum, e.g. hard negative sample mining, was shown to be useful in other tasks (Soviany et al., 2022), our results indicate that learning rate anti-curriculum attains inferior performance compared with our approach. Furthermore, anti-LeRaC is also below the conventional regime, confirming our conjecture regarding this strategy.

Summary. Notably, our ablation results show that the majority of hyperparameter configurations tested for LeRaC lead to outperforming the conventional regime, demonstrating the stability of LeRaC. We present additional experiments in “Appendix C”.

5 Discussion

Interaction with optimization algorithms. Throughout our experiments, we always keep using the same optimizer for a certain neural model, for all training regimes (conventional, CBS, LeRaC). The best optimizer for each neural model is established for the conventional training regime. We underline that our initial learning rates and scheduler are used independently of the optimizers. Although our learning rate scheduler updates the learning rates at the beginning of every iteration, we did not observe any stability or interaction issues with any of the optimizers (SGD, Adam, AdaMax, AdamW).

Interaction with other curriculum learning strategies. Our simple and generic curriculum learning scheme can be integrated into any model for any task, not relying on domain or task dependent information, e.g. the data samples. In Table 16 from “Appendix  C”, we show that combining LeRaC and CBS can boost performance. In a similar fashion, LeRaC can be combined with data-level curriculum strategies for improved performance. We leave this exploration for future work.

Interaction with other learning rate schedulers. Whenever a learning rate scheduler is used for training a model in our experiments, we simply replace the scheduler with LeRaC until epoch k. For example, all the baseline CvT results are based on linear warm-up with cosine annealing, this being the recommended scheduler for CvT (Wu et al., 2021). When we introduce LeRaC, we simply deactivate alternative schedulers between epochs 0 and k. In general, we recommend deactivating other schedulers while using LeRaC for simplicity in avoiding stability issues.

Limitations of our work. One limitation is the need to disable other learning rate schedulers while using LeRaC. We already tested this scenario with linear warm-up with cosine annealing, which is removed when using LeRaC, observing consistent performance gains (see Table 3). However, disabling alternative learning rate schedulers might bring performance drops in other cases. Hence, this has to be decided on a case by case basis. Another limitation is the possibility of encountering longer training times or poor convergence when the hyperparameters are not properly configured. We recommend hyperparameter tuning on the validation set to avoid this outcome.

6 Conclusion

In this paper, we introduced a novel model-level curriculum learning approach that is based on starting the training process with increasingly lower learning rates per layer, as the layers get closer to the output. We conducted comprehensive experiments on 12 data sets from three domains (image, text and audio), considering multiple neural architectures (CNNs, RNNs and transformers), to compare our novel training regime (LeRaC) with a state-of-the-art regime (CBS (Sinha et al., 2020)), as well as the conventional training regime (based on early stopping and reduce on plateau). The empirical results demonstrate that LeRaC is significantly more consistent than CBS, perhaps being one of the most versatile curriculum learning strategy to date, due to its compatibility with multiple neural models and its usefulness across different domains. Remarkably, all these benefits come for free, i.e. LeRaC does not add any extra time over the conventional approach.