1 Introduction

As a hot research direction for processing neural networks, Deep Learning [1] has received considerable attention in recent years. A deep learning model is a multi-layered neural network structure, which consists of an input layer, multiple hidden layers, and an output layer to simulate the multi-level learning process of the human brain. It has demonstrated significant advantages in computer vision, natural language processing, and other fields, and is widely applied in video stabilization, object recognition, image processing, and other areas [2,3,4]. Each layer of a deep neural network contains a large number of parameters, which determine the accuracy of the network’s output and reflect the effectiveness of the model. In order to obtain excellent deep learning models, it is necessary to use deep learning optimizers to optimize and update the parameters during model training.

Because the deep neural networks have the characteristics of non-convexity, nonlinearity, deep hierarchical hidden structures, and a large number of parameters, the optimization problem in deep learning is quite complicated. An excellent optimizer can make the parameters converge to the target point with a low loss value and improve the speed and accuracy of the model to complete the task. At present, optimizers in deep learning can be mainly categorized into first-order methods and second-order methods [5, 6].

First-order optimization algorithms are based on Gradient Descent (GD), which was originally a fundamental method in optimization theory and later extended to the field of deep learning. Stochastic Gradient Descent (SGD) [7] is the most basic method in practical deep learning tasks. In each iteration, one or more samples (less than the total number of samples) are selected randomly, and the gradient of model parameters is calculated to update the parameters, aiming to minimize the loss function value of the model. As shown in Fig. 1, the basic SGD method has some main drawbacks [8]: (1) The gradients may be highly sensitive to certain directions in the parameter space; (2) It can get stuck in local minima or saddle points where the gradient is zero; (3) It applies the same update step size for each parameter without considering the gradient variation information, resulting in poor optimization.

Fig. 1
figure 1

SGD is sensitive to some directions (left) and can fall into local minima or saddle points (right)

Given several shortcomings of SGD, many researchers have put forward some improved optimization algorithms. Stochastic Gradient Descent with Momentum (SGDM) [9] is a popular variant that analogizes the concept of momentum in physics, considering the influence of historical gradient directions. If the gradient direction at the current moment is similar to that at the historical moment, SGDM accelerates the learning speed of parameters. It solves some problems in SGD [10]. However, SGDM, like SGD, still uses a fixed constant as the learning rate, so that all parameters change in the same step size, regardless of the gradient behavior. In recent years, adaptive learning rate optimizers have gained popularity in deep neural networks optimization, which can adjust the update step size dynamically according to the requirements of parameter update during the whole iteration process. Adam [11] is one of the most widely used adaptive learning rate optimizers, which can increase or decrease the update step size utilizing the gradient information. If the loss surface is regarded as a rugged hillside, Adam behaves like a frictional heavy ball and can find the target value faster. However, Adam tends to converge to the sharp local minima with poor generalization performance, rather than the expected flat minima, resulting in inferior performance compared to SGD-like optimizers in some tasks [12, 13].

Second-order optimization algorithms, such as Newton’s method and the Conjugate Gradient method [14], require second-order derivatives to find the optimal parameter values. They utilize both gradient information and the trend of gradient changes, to some extent, to avoid local optima problems. However, the complexity of deep neural networks limits the practical application of second-order optimizers due to the high computational cost and memory requirements.

The mainstream optimizers mentioned above are all based on integer-order derivatives to obtain gradient information, which have limitations. Specifically, during the training process, they only use the gradient information of the current point to guide the optimization direction, which is calculated by the integer-order derivatives directly and lacks the global state and contextual information. This often leads to local optima issues.

Fractional-order derivative is a natural extension of integer-order derivative operation, where the order of derivative calculation is extended from integers to fractions. Fractional-order calculus has achieved excellent performance in various fields, such as data preprocessing, control systems, and time series prediction [15,16,17]. Recently, there has been a trend of applying fractional-order calculus methods to neural network optimization. Compared with the traditional integer-order derivative, the fractional-order derivative possesses the advantage of global memory characteristics in both time and space domains, and expands the search area around the target point [18]. So that additional information can be obtained to avoid the local optima problems caused by integer-order derivatives. Herrera-Alcántara [19] presented optimizers that introduce a fractional derivative gradient of the objective function in the parameter update rule, as well as an implementation for the Tensorflow backend. Yu et al. [20] designed a FracM optimizer based on fractional-order calculus and SGDM algorithm, using the fractional-order difference of momentum and gradient to adjust the optimization direction. Zhou et al. [21] proposed FCGD_G–L algorithm, which uses the Grünwald–Letnikov fractional-order derivative [22] to replace the first-order derivative in SGD and Adam, and adds a disturbance factor to improve the robustness of the algorithm. These are all successful algorithms using fractional-order calculus for deep learning optimization. However, these algorithms only use the historical long-term estimated gradients based on fractional-order calculations to adjust the parameter update step size, ignoring the great influence of the recent real gradients, which sometimes leads to poor optimization speed and accuracy. This is the focus of our research.

In this paper, we propose a novel adaptive learning rate optimizer called AdaGL, based on the classical Grünwald–Letnikov (G–L) fractional-order derivative. AdaGL can guide the parameter update direction and step size on the basis of both long-term and short-term gradients information, enabling it to escape local minima and saddle points and converge to the flat target value quickly. Specifically, we start from the classical definition of the G–L fractional-order derivative in mathematics and replace the parameter update gradient with the fractional-order approximated gradient to incorporate the long memory and global correlation characteristics. Moreover, a step size control coefficient is designed to increase or decrease the update step size adaptively according to the real-time change of the short-term gradients. Theoretical analysis demonstrates the effectiveness of the proposed algorithm in addressing the problem of getting trapped in local minima and saddle points. We have also carried out extensive experiments on various deep learning tasks, including image classification, node classification, graph classification, image generation, and language modeling. The experimental results show that the proposed AdaGL optimizer can converge quickly, with excellent accuracy and good generalization performance.

In summary, the main contributions of our work are as follows:

  • Based on the ability of G–L fractional derivatives to capture the historical global memory characteristics of the objective function, we derive the G–L fractional-order approximated gradient theoretically to replace the gradient of parameter updating in the neural networks, so that the curvature information of the objective function can be fully utilized.

  • Considering the significant influence of recent gradient information during the optimization stage, we introduce a step size control coefficient, which can feedback and adjust the parameter update step size using the short-term gradients change. This allows us to jump out of the unexpected sharp minima and saddle points and accelerate the learning process.

  • Combining G–L fractional-order approximated gradient and step size control coefficient, we propose an adaptive learning rate optimizer named AdaGL. It comprehensively utilizes both past long-term and current short-term gradients information to regulate the adaptive learning rate, preventing the optimizer from getting trapped in local minima and saddle points and ensuring rapid and stable convergence to flat optimal points.

  • To evaluate the performance of the proposed AdaGL optimizer, we conduct experiments on a variety of deep learning classic architectures and datasets, comparing it with other popular optimizers. Our experiments include image classification with CNNs architectures (ResNet [23] and DenseNet [24]), node classification and graph classification with Graph Convolutional Networks (GCN) [25], image generation with Wasserstein Generative Adversarial Networks (WGAN) [26], and language modeling with Long Short-Term Memory (LSTM) [27]. AdaGL achieves advanced performance in these tasks, improving the convergence speed and accuracy of the network models.

The rest of this paper is organized as follows: Sect. 2 introduces the related work and preliminary mathematical preparation. Section 3 provides a detailed description of the proposed optimizer. Section 4 presents experimental validations of the proposed algorithm. Section 5 concludes the paper and discusses future work.

2 Preliminaries

In order to find the optimal parameters, the most fundamental method adopted in most deep learning neural networks is SGD. From all the n samples in the training set, select \(m\left( m\le n\right) \) independent and identically distributed small batch samples \(\left\{ x^{\left( 1\right) },\dots ,x^{\left( m\right) }\right\} \) randomly, where \(x^{\left( i\right) }\) corresponds to the target \(y^{\left( i\right) }\). In tth iteration, the defined objective optimization function L is applied to calculate the gradients of each parameter \(\theta \) in the network on the mini-batch samples, and then average their gradients to obtain the estimated gradient \(g_{t}\):

$$\begin{aligned} g_{t} \leftarrow \frac{1}{m} \bigtriangledown _{\theta } {\textstyle \sum _{i}} L\left( f\left( x^{\left( i\right) };\theta _{t-1}\right) ,y^{\left( i\right) }\right) \end{aligned}$$
(1)

Then, in the negative gradient direction, the parameter values are updated by using the learning rate hyperparameter and the gradient:

$$\begin{aligned} \theta _{t} \leftarrow \theta _{t-1}-\eta g_{t} \end{aligned}$$
(2)

where \(\theta _{t-1}\) and \(\theta _{t}\) are the previous and updated parameter values respectively, and \(\eta \) is a fixed learning rate hyperparameter.

SGDM is a widely used extension of SGD that incorporates past gradient information in each dimension to maintain a momentum m, which is defined as the Exponential Moving Average (EMA) of the gradients. The parameters are updated as:

$$\begin{aligned} m_{t} \leftarrow \beta m_{t-1} + g_{t} \nonumber \\ \theta _{t} \leftarrow \theta _{t-1}-\eta m_{t} \end{aligned}$$
(3)

where \(m_{t}\) is the momentum obtained in the tth iteration \(\left( m_{0}=0\right) \), and \(\beta \) is a hyperparameter for controlling the decay rate of the momentum.

Inspired by fractional calculus, FracM was put forward, which calculates the momentum and gradient in SGDM in fractional-order derivative instead of the traditional first-order derivative. The parameter update is defined as follows:

$$\begin{aligned} _{GL}D_{t}^{\alpha } m_{t}= & {} \beta _{GL}D_{t}^{\alpha } m_{t-1} + _{GL}D_{t}^{\alpha } g_{t} \nonumber \\= & {} \beta \left( m_{t-1}+\alpha _{1}m_{t-3}+\alpha _{2}m_{t-5}+\alpha _{3}m_{t-7} \right) \nonumber \\{} & {} + \left( g_{t}+\alpha _{1}g_{t-2}+\alpha _{2}g_{t-4}+\alpha _{3}g_{t-6} \right) \nonumber \\ m_{t-1}\leftarrow & {} _{GL}D_{t}^{\alpha } m_{t} \nonumber \\ \theta _{t}\leftarrow & {} \theta _{t-1}-\eta m_{t-1} \end{aligned}$$
(4)

where \(_{GL}D_{t}^{\alpha }\) represents performing a G–L fractional-order operation with a fractional order \(\alpha \) in the tth iteration, and \(\alpha _{k}(k=1,2,3)\) is the default coefficient of the G–L fractional order.

The optimization algorithms described above use a fixed constant as the learning rate, causing all parameters to be updated with the same step size. In recent years, the adaptive learning rate algorithms have been widely used in deep learning tasks, which assign an adaptive step size to each parameter based on the current state.

Duchi et al. [28] proposed the first popular adaptive learning rate optimizer called AdaGrad. AdaGrad divides the learning rate by the square root of the accumulated sum of squared gradients for each parameter, enabling dynamic adjustment of the learning rate. The parameter update rule is as follows:

$$\begin{aligned}{} & {} G_{t} \leftarrow \beta G_{t-1} + g_{t}^{2} \nonumber \\{} & {} \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta g_{t}}{\sqrt{G_{t}}+\delta } \end{aligned}$$
(5)

where \(\delta \) is a numerical stability constant added in the denominator to prevent division by zero.

However, AdaGrad accumulates the squared gradients continuously, leading to a sharp decline in the adaptive learning rate and hindering the learning process. To address this drawback, RMSProp [29] was proposed as an improvement to AdaGrad. RMSProp uses the EMA to calculate the cumulative squared gradients, focusing only on the gradient information within a recent time window. The decay rate hyperparameter \(\beta \) is used to control the length of the time window. The parameter update rule is defined as:

$$\begin{aligned}{} & {} G_{t} \leftarrow \beta G_{t-1} + \left( 1-\beta \right) g_{t}^{2} \nonumber \\{} & {} \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta g_{t}}{\sqrt{G_{t}}+\delta } \end{aligned}$$
(6)

Adam can be regarded as the combination of SGDM and RMSprop, which adaptively adjusts the learning rate based on two vectors known as the first-order moment and the second-order moment. The first and second moments are defined by the EMA of the gradient and the squared gradient, respectively. However, these moments can be biased towards zero, especially during the initial iterations. To address this, a bias correction is applied. For the tth iteration, the parameter update in the Adam is defined as follows:

$$\begin{aligned}{} & {} m_{t} \leftarrow \beta _{1} m_{t-1} + \left( 1-\beta _{1} \right) g_{t} \nonumber \\{} & {} v_{t} \leftarrow \beta _{2} v_{t-1} + \left( 1-\beta _{2} \right) g_{t}^{2} \nonumber \\{} & {} \hat{m_{t}}\leftarrow \frac{m_{t}}{1-\beta _{1}^{t}} \nonumber \\{} & {} \hat{v_{t}}\leftarrow \frac{v_{t}}{1-\beta _{2}^{t}} \nonumber \\{} & {} \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta \hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\delta } \end{aligned}$$
(7)

where \(\beta _{1}\) and \(\beta _{2}\) are the decay rate hyperparameters for the first-order moment \(m_{t}\) and the second-order moment \(v_{t}\) respectively, and \(m_{0}=0\), \(v_{0}=0\).

During the later stages of training, when the gradients decrease significantly, Reddi et al. [30] observed that the adaptive learning rate of Adam increases, leading to potential divergence in parameter update. To address this issue, they proposed the AMSGrad. AMSGrad modifies the parameter update by using the maximum value of the past second-order moments, applying more friction to the optimization process to prevent overshooting the target value. The parameter update rule is defined as:

$$\begin{aligned} \tilde{v}_{t}^{max} \leftarrow max \left( \tilde{v}_{t-1}^{max}, \hat{v_{t}} \right) \nonumber \\ \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta \hat{m_{t}}}{\sqrt{\tilde{v}_{t}^{max}}+\delta } \end{aligned}$$
(8)

DiffGrad [31] is another improvement upon Adam. Unlike AMSGrad, which relies on long-term gradients information, diffGrad focuses on short-term gradients variations. It introduces a diffGrad Friction Coefficient (DFC) to control the adaptive learning rate. The DFC at the tth iteration, denoted as \(\xi _{t}\), and the parameter update rule is defined as follows:

$$\begin{aligned}{} & {} \xi _{t}=AbsSig\left( \bigtriangleup g_{t}\right) =AbsSig\left( g_{t-1}-g_{t}\right) =\frac{1}{1+e^{-\left| g_{t-1}-g_{t}\right| }} \nonumber \\{} & {} \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta \xi _{t} \hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\delta } \end{aligned}$$
(9)

Nevertheless, DFC compresses the step size of parameter update to 0.5–1 times, which further reduces the momentum and leads to slow convergence speed.

Zhou et al. [21] based on the definition of G–L fractional-order calculus, designed two optimizers called FCSGD_G–L and FCAdam_G–L by combining SGD and Adam, respectively. The gradients for both optimizers with a fractional order \(\alpha \) are defined as:

$$\begin{aligned} _{\alpha }^{G}D_{\theta _{t-1}}^{\alpha } L\left( \theta _{t-1}\right) = g_{t} + \sum _{i=1}^{10} c\left[ i-1\right] w_{i} g_{t-i} \end{aligned}$$
(10)

where \(c\left[ \cdot \right] \) is the perturbation coefficient of 0 or 1, \(w_{i} = \left( 1-\frac{\alpha +1}{i+1} \right) w_{i-1}\) and \(w_{0}=0\).

Parameter update of FCSGD_G–L is defined as:

$$\begin{aligned} \theta _{t} \leftarrow \theta _{t-1}- \eta _{\alpha }^{G}D_{\theta _{t-1}}^{\alpha } L\left( \theta _{t-1}\right) \end{aligned}$$
(11)

Parameter update of FCAdam_G–L is defined as:

$$\begin{aligned}{} & {} m_{t} \leftarrow \beta _{1} m_{t-1} + \left( 1-\beta _{1} \right) _{\alpha }^{G}D_{\theta _{t-1}}^{\alpha } L\left( \theta _{t-1}\right) \nonumber \\{} & {} v_{t} \leftarrow \beta _{2} v_{t-1} + \left( 1-\beta _{2} \right) \left[ _{\alpha }^{G}D_{\theta _{t-1}}^{\alpha } L\left( \theta _{t-1}\right) \right] ^{2} \nonumber \\{} & {} \hat{m_{t}}\leftarrow \frac{m_{t}}{1-\beta _{1}^{t}} \nonumber \\{} & {} \hat{v_{t}}\leftarrow \frac{v_{t}}{1-\beta _{2}^{t}} \nonumber \\{} & {} \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta \hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\delta } \end{aligned}$$
(12)

3 Algorithm

The first-order optimization algorithms only utilize the local neighborhood information of the parameters to be updated, which leads to a risk of getting trapped in local minima. The second-order optimization algorithms make better use of the curvature information of the function but are constrained by computational complexity, which limits their widespread adoption. In this section, we introduce the G–L fractional-order derivative method and use it to approximate the gradient, thereby supplementing the long-term state information lacking in the first-order optimizers. We then introduce a step size control coefficient that reflects short-term gradients changes, allowing the parameter update step size to be flexibly adjusted in real-time according to the current short-term state information. Finally, we combine these techniques to create our new adaptive learning rate optimizer AdaGL, which can accelerate convergence and prevent falling into the local optima.

3.1 G–L Fractional-Order Derivative

For a continuous function \(f\left( x\right) \), the definition of the integer-order derivative with order n is as follows:

$$\begin{aligned} f^{\left( n\right) }\left( x\right) = \frac{d^{n}f\left( x\right) }{d x^{n}} = \lim _{h\rightarrow 0}\frac{1}{h^{n}}\sum _{r=0}^{n} \left( -1\right) ^{r} \frac{n\left( n-1\right) \dots \left( n-r+1\right) }{r!} f\left( x-rh\right) \end{aligned}$$
(13)

Fractional-order derivative is a classical concept in mathematics, which can be regarded as a generalized form of integer-order derivative. By extending the derivative order from integer to arbitrary rational number, the definition of fractional-order derivative can be obtained. The Grünwald–Letnikov (G–L) fractional-order derivative of a function \(f\left( x\right) \) with order \(\alpha \) is defined as:

$$\begin{aligned} _{G-L}D_{t}^{\alpha } f\left( x\right) = \lim _{h\rightarrow 0}\frac{1}{h^{\alpha }}\sum _{j=0}^{\frac{t-t_{0}}{h}} \left( -1\right) ^{j} \frac{\Gamma \left( \alpha +1\right) }{\Gamma \left( j+1\right) \Gamma \left( \alpha -j+1\right) } f\left( x-jh\right) \end{aligned}$$
(14)

where h is the step size, and t, \(t_{0}\) represents the upper and lower bounds of the steps respectively, and \(\Gamma (\cdot )\) denotes the Gamma function.

When the step size h is small enough, the limit operation in Eq. 14 can be neglected. In the optimization process of neural networks, the step size h for parameter update is not a continuous value. We set h to its minimum value, that is \(h=1\). Therefore, in this case, the definition of the G–L fractional-order derivative can be approximated as follows:

$$\begin{aligned} _{G-L}D_{t}^{\alpha } f\left( x\right)\approx & {} \frac{1}{h^{\alpha }}\sum _{j=0}^{\frac{t-t_{0}}{h}} \left( -1\right) ^{j} \frac{\Gamma \left( \alpha +1\right) }{\Gamma \left( j+1\right) \Gamma \left( \alpha -j+1\right) } f\left( x-jh\right) \nonumber \\= & {} \sum _{j=0}^{t-t_{0}} \left( -1\right) ^{j} \frac{\Gamma \left( \alpha +1\right) }{\Gamma \left( j+1\right) \Gamma \left( \alpha -j+1\right) } f\left( x-j\right) \end{aligned}$$
(15)

Equation 15 is an infinite expansion formula. Nevertheless, when performing computations on a computer, it is necessary to convert Eq. 15 into a finite series expansion. Some studies have shown that when the fractional-order derivative formula is expanded to 10 terms, the effect of the fractional order \(\alpha \) on the fractional-order derivative is minimal [32]. At this point, the properties of the fractional-order derivative can be well expressed in neural networks. Therefore, we set the number of expansion terms in the G–L fractional-order derivative formula to 10, resulting in:

$$\begin{aligned} _{G-L}D_{t}^{\alpha } f\left( x\right) = \sum _{j=0}^{10} \left( -1\right) ^{j} \frac{\Gamma \left( \alpha +1\right) }{\Gamma \left( j+1\right) \Gamma \left( \alpha -j+1\right) } f\left( x-j\right) \end{aligned}$$
(16)

3.2 Step Size Control Coefficient

In the adaptive learning rate optimization algorithms, the most critical thing is the control mode of learning rate. The goal of deep neural networks optimization is to find a flat minimum with low loss. An adaptive learning rate optimizer should have the ability to escape local minima and saddle points and stay away from sharp minima. To adjust the adaptive learning rate appropriately and ensure that the parameters are iteratively updated in the desired direction, we introduce a step size control coefficient.

Inspired by some common and popular activation functions in deep learning, we compare and select the softsign activation function \(y=x/(1+|x|)\) [33] experimentally. We scale and shift it to obtain the step size control coefficient \(C_{t}\) in the tth iteration:

$$\begin{aligned} C_{t} = \frac{0.5\left| \bigtriangledown g_{t}\right| }{1+\left| \bigtriangledown g_{t}\right| } + 0.6 = 1.1 -\frac{1}{2\left( 1+\left| g_{t-1}-g_{t}\right| \right) } \end{aligned}$$
(17)
Fig. 2
figure 2

Function image of the step size control coefficient

The step size control coefficient utilizes the short-term gradients behavior to control the learning rate. Its function image is shown in Fig. 2. By observing the formula and image of the step size control coefficient, we can find that it is a function with a value range of [0.6,1.1). This ensures that the adaptive learning rate does not decay too much, avoiding a serious slowdown of the learning process. Meanwhile, the step size control coefficient has the potential to increase the rate of moving away from unexpected regions. To illustrate this more visually, as shown in Fig. 3, when the instantaneous change of the gradient is small, i.e., \(\left| g_{t-1}-g_{t}\right| \) is small, it indicates that the algorithm may be close to a flat minimum of the objective. In this case, the parameter update step size adaptively decreases, increasing the possibility of further exploration. On the other hand, when the instantaneous change of gradient is large, i.e., \(\left| g_{t-1}-g_{t}\right| \) is large, it suggests that the algorithm may have reached a local (or sharp) minimum or a saddle point. In this case, the parameter update step size does not decrease or may increase slightly, providing the momentum to escape from that region and continue searching for better optima.

Fig. 3
figure 3

Step size control coefficient dynamically adjusts the parameter update speed by short-term gradients change

3.3 AdaGL Optimizer

Algorithm 1
figure a

AdaGL Optimizer

When Eq. 16 is applied to the training process of deep neural networks, for the objective optimization function L, the G–L fractional-order approximated gradient of the parameter \(\theta \) with order \(\alpha \) can be written as:

$$\begin{aligned} _{G-L}D_{t}^{\alpha } L\left( \theta _{t-1}\right) = \sum _{j=0}^{10} \left( -1\right) ^{j} \frac{\Gamma \left( \alpha +1\right) }{\Gamma \left( j+1\right) \Gamma \left( \alpha -j+1\right) } g_{t-j} \end{aligned}$$
(18)

We adopt a similar approach to the Adam algorithm, replacing the gradient with the biased first order moment \(m_{t}\) and second order moment \(v_{t}\) calculated using the G–L fractional-order approximated gradient in Eq. 18. To address the issue of initial biases towards zero during the early iterations, bias corrections are separately applied, resulting in \(\hat{m_{t}}\) and \(\hat{v_{t}}\):

$$\begin{aligned}{} & {} m_{t} \leftarrow \beta _{1} m_{t-1} + \left( 1-\beta _{1} \right) _{G-L}D_{t}^{\alpha } L\left( \theta _{t-1}\right) \nonumber \\{} & {} v_{t} \leftarrow \beta _{2} v_{t-1} + \left( 1-\beta _{2} \right) \left[ _{G-L}D_{t}^{\alpha } L\left( \theta _{t-1}\right) \right] ^{2} \nonumber \\{} & {} \hat{m_{t}}\leftarrow \frac{m_{t}}{1-\beta _{1}^{t}} \nonumber \\{} & {} \hat{v_{t}}\leftarrow \frac{v_{t}}{1-\beta _{2}^{t}} \end{aligned}$$
(19)

Finally, by applying the step size control coefficient calculated in Eq. 17, the parameter update formula of our proposed AdaGL optimizer is obtained:

$$\begin{aligned} \theta _{t} \leftarrow \theta _{t-1}-\frac{\eta C_{t} \hat{m_{t}}}{\sqrt{\hat{v_{t}}}+\delta } \end{aligned}$$
(20)

In Eq. 20, the modified first and second order moments contain the long-term memory characteristics of fractional-order gradients, while the step size control coefficient reflects the short-term variations of the real gradients. The combination of these two components enables the AdaGL optimizer to have both global and local perspectives. By utilizing the long-term gradients to maintain the overall direction and the short-term gradients to make fine adjustments in the details, the parameters are able to find the correct iterative direction more effectively and descend rapidly.

The pseudocode of the proposed AdaGL is provided in Algorithm 1.

4 Experiments

In this section, we perform various deep learning tasks to evaluate and compare the performance of the proposed AdaGL optimizer with other popular optimizers. The experiments include:

  1. (1)

    Image classification (CIFAR10 dataset [34]) with CNNs frameworks of ResNet and DenseNet;

  2. (2)

    Node classification (Cora, Citeseer and Pubmed datasets [35]) and graph classification (MUTAG [36], PTC-MR [37], BZR [38], COX2 [38], PROTEINS [39] and NCI1 [40] datasets) with GCN;

  3. (3)

    Image generation (CIFAR10 dataset) with WGAN;

  4. (4)

    Language modeling (Penn TreeBank dataset [41]) with LSTM.

The settings for each task are described in Table 1.

The compared optimizers include:

  1. (1)

    SGD-like optimizers: SGD, MSVAG [42], FracM and FCSGD_G–L;

  2. (2)

    Adaptive learning rate optimizers: Adam, Yogi [43], AdaBound [44], AdaMod [45], RAdam [46], diffGrad, AdaBelief [47], AdaDerivative [48] and FCAdam_G–L.

Table 1 Network architecture and dataset settings in experimental tasks

4.1 Image Classification with CNNs

First, using the CIFAR10 dataset, we conduct image classification experiments on the popular Convolutional Neural Networks frameworks, ResNet34 and DenseNet121.

The CIFAR10 dataset consists of 60,000 images, including 50,000 images for training and 10,000 images for validation. The size of all images is \(32\times 32 \times 3\). The ResNet architecture introduces skip connections between layers, allowing the output of one previous layer to be directly connected to the input of a later layer. This residual connection in ResNet enables the learning target to be the residual between the output and input. This design facilitates the training of deeper CNNs and achieves higher accuracy. The DenseNet architecture follows a similar idea as ResNet but establishes dense connections between all preceding layers and subsequent layers, leading to improved performance compared to ResNet.

The hyperparameters for the experiments are set as follows: epoch is 200, the batch size is 128, the initial learning rates are set to 0.1 for SGD, FracM and FCSGD_G–L, and 0.001 for other optimizers. The learning rates for all optimizers are reduced by a factor of 10 in the 150th epoch. For all optimizers in the experiment, use the default optimal parameter settings in the source code. The decay rate hyperparameter \(\beta \) for momentum-based optimizers is 0.9, and the moment decay rate hyperparameters \(\beta _{1}\) and \(\beta _{2}\) for adaptive learning rate optimizers are 0.9 and 0.999 respectively. The fractional order \(\alpha \) is 1.5. The numerical stability constant \(\delta \) is 1e\(-\)8 and weight decay is 5e\(-\)4.

Fig. 4
figure 4

Train and test accuracy curves of optimizers for ResNet34 & DenseNet121 on CIFAR10

Table 2 Test accuracy values of optimizers for ResNet34 & DenseNet121 on CIFAR10 (in %)

Figure 4 shows the train and test accuracy curves on ResNet34 and DenseNet121. Table 2 presents the mean and standard deviation (in %) of the test accuracy for each optimization algorithm. The experiments verify the fast and stable convergence performance of our proposed optimizer. In comparison to SGD-like optimizers, our method ultimately achieves superior performance than traditional SGD, especially on DenseNet121. However, it does not surpass the classification accuracy achieved by two SGD improvements, FracM and FCSGD_G–L. In addition, compared with other adaptive learning rate optimizers, our optimizer performs the best on the test set, achieving approximately 1.04% and 1.13% accuracy improvements over Adam on ResNet34 and DenseNet121, respectively.

Previous studies have indicated that, in image classification tasks on the CIFAR10 dataset, although adaptive learning rate algorithms converge faster than SGD-like algorithms, SGD-like algorithms typically yield better final accuracy results [49, 50]. For classification tasks in Computer Vision, adaptive learning rate optimizers tend to find sharp minima rather than flat minima, leading to poorer generalization performance compared to SGD-like optimizers. Our proposed AdaGL outperforms SGD on the test set (other adaptive learning rate optimizers do not surpass SGD), validating its ability to control step size to some extent for finding flat minima. However, its search capability is still slightly inferior to the latest non-adaptive learning rate optimizers.

4.2 Node Classification and Graph Classification with GCN

Research on the performance of different optimizers on Graph Neural Networks (GNNs) is still limited. In this section, we conduct node classification and graph classification experiments on multiple graph benchmark datasets to evaluate the performance of the current popular optimizers and our proposed optimizer.

Table 3 provides the statistical information of the graph benchmark datasets used in the experiments. For both experiments, we employ a three-layer standard GCN as the model. GCN is one of the most classic architectures in GNNs. It utilizes a first-order Chebyshev polynomial approximation and defines graph convolutional operations by mapping graph signals to the spectral domain. This allows GCN to handle non-Euclidean spatial data that traditional CNNs struggle with.

Table 3 Statistical information of the graph benchmark datasets

First, we consider the node classification task on the Cora, Citeseer and Pubmed datasets. All of them are citation network datasets used for semi-supervised document classification. They are undirected graphs where nodes represent papers and edges represent citation relationships. In the experiments, we use standard splits for 10 runs of experimental evaluation, that is, in each class 20 nodes are used for training, 500 nodes are used for validation, and 1000 nodes are used for testing. We train 200 epochs in each run. The initial learning rates are set to 0.07 for Cora, 0.1 for Citeseer, and 0.12 for Pubmed. The optimal accuracy of each run is recorded separately, and the mean and the standard deviation (in %) are calculated. Table 4 provides a comparison of the performance between our optimizer and existing optimizers.

Table 4 Node classification on 3-layer GCN: mean ± standard deviation of accuracy over 10 runs (in %)

By observing the experimental results in Table 4, we can see that our proposed optimizer achieves the best accuracy on both Cora and Pubmed datasets, although it does not achieve the best performance on the Citeseer dataset, it still performs well compared to other adaptive learning rate optimization algorithms.

Next, we conduct graph classification experiments to further evaluate the performance of the proposed optimization algorithm. The goal of graph classification tasks is to learn the mapping function between graphs and corresponding category labels to correctly predict the class of unlabeled graphs. We choose six benchmark datasets from bioinformatics for graph classification: MUTAG, PTC-MR, BZR, COX2, PROTEINS and NCI1. They are all datasets about chemical molecules or compounds, where nodes represent atoms and edges represent chemical bonds. The task is to judge the types or properties of compounds. In the experiments, we split the datasets into 90% training set and 10% test set randomly, and conduct 10-fold cross validation. For each fold, we train the model for 300 epochs with an initial learning rate chosen from the optimal values searched in the range of {0.007, 0.01, 0.015, 0.022}. The batch size is set to 64. We record the best accuracy achieved in each fold and calculate the mean and standard deviation over the 10-fold cross validation (in %). Table 5 provides a comparison of the performance between our optimizer and existing optimizers.

Table 5 Graph classification on 3-layer GCN: mean ± standard deviation of accuracy over 10 runs (in %)

As can be seen from Table 5, in the graph classification experiments on the bioinformatics datasets, our method achieves the second-best classification accuracy on the MUTAG and NCI1 datasets, while it achieves the best performance on the other four datasets.

4.3 Image Generation with WGAN

We perform image generation tasks on the CIFAR10 dataset using a WGAN model with the original CNN generator. WGAN is a classic generative model that improves upon many issues in the original GAN. The original GAN suffers from training difficulties, heavy reliance on the design of the generator and discriminator, and lack of sample diversity, among other problems. WGAN successfully alleviates these limitations.

In the experiment, we train the model for 100 epochs and generate 64,000 fake images from noise. We calculate the Inception Score (IS) [51] for fake images, as well as the Frechet Inception Distance (FID) [52] between the fake images and the real images. Both IS and FID are widely used metrics for evaluating the performance of generative models. IS (higher is better) measures the diversity and realism of the generated images, while FID (lower is better) reflects both the quality and diversity of the generated images.

For each optimizer, we use the optimal hyperparameter settings from the source code. We conduct 5 runs experiments, and the results are shown in Table 6. Compared to other optimizers, our proposed AdaGL optimizer significantly outperforms them on the WGAN, achieving the lowest FID and the highest IS. Figure 5 shows real images and fake samples generated by the WGAN trained with our proposed AdaGL optimizer.

Table 6 Mean ± standard deviation of FID (Lower is better) and IS (Higher is better) over 5 runs with WGAN on CIFAR10
Fig. 5
figure 5

Real images (left) and fake samples generated on WGAN trained by AdaGL optimizer (right)

4.4 Language Modeling with LSTM

In the language modeling experiment, we use 1-layer, 2-layer and 3-layer LSTM on the Penn TreeBank dataset to verify the performance of the proposed optimizer. The Penn TreeBank dataset is a classic English language model dataset consisting of 900,000 words of English text, commonly used for Natural Language Processing tasks. LSTM is a variant of the Recurrent Neural Networks (RNNs) model that can effectively handle time series data, particularly in long sequences where it outperforms classical RNNs.

Fig. 6
figure 6

PPL (lower is better) curves with different layers LSTM on Penn TreeBank

Our experiment runs 200 epochs with a batch size of 20. The initial learning rates are set to 30 for SGD, MSVAG and FCSGD_G–L, 1 for FracM, and the optimal values from 0.001 and 0.01 for other optimizers. We use perplexity (PPL) as the evaluation metric for performance. The experimental results are shown in Fig. 6 and Table 7.

Observing the experimental results, the following conclusions can be drawn: Regardless of whether it is a 1-layer, 2-layer, or 3-layer LSTM model, the proposed optimizer demonstrates fast convergence speed and achieves the lowest PPL during both training and testing stages, outperforming other optimizers. Furthermore, when comparing different numbers of layers, differences in the extent of improvement can be observed. Compared to the Adam optimizer, our optimizer reduces perplexity by 4.45% in the 1-layer LSTM and by 5.15% in the 3-layer LSTM, which indicates that our optimizer achieves better performance improvements when handling more complex models and larger-scale language modeling tasks.

5 Conclusion

This article proposes a new adaptive learning rate optimization algorithm called AdaGL. AdaGL is based on the G–L fractional-order derivative method, which approximates the gradient during parameter update and fully utilizes the long-term curvature information of the loss function. At the same time, we introduce a step size control coefficient that controls the parameter update direction and step size based on recent gradients changes, enabling acceleration to escape when encountering unexpected local (or sharp) minima or saddle points, and deceleration to explore when encountering flat target points.

Extensive experimental results on image classification, node classification, graph classification, image generation, and language modeling demonstrate that AdaGL achieves excellent performance compared to traditional optimizers, and can converge quickly and stably with high accuracy across various deep learning tasks.

Table 7 Test PPL (lower is better) with different layers LSTM on Penn TreeBank

In the future, we will continue to explore the integration of fractional-order calculus methods and deep learning optimizers, as well as the properties of loss surfaces that affect the optimization algorithms’ ability to find target points. We plan to extend the different definitions of fractional-order calculus in mathematics to deep learning optimization methods and further investigate hyperparameters such as fractional order. By comparing and analyzing performances, we can select the most suitable optimization method when facing different neural network architectures.