Adaptive hierarchical hyper-gradient descent

Adaptive learning rate strategies can lead to faster convergence and better performance for deep learning models. There are some widely known human-designed adaptive optimizers such as Adam and RMSProp, gradient based adaptive methods such as hyper-descent and practical loss-based stepsize adaptation (L4), and meta learning approaches including learning to learn. However, the existing studies did not take into account the hierarchical structures of deep neural networks in designing the adaptation strategies. Meanwhile, the issue of balancing adaptiveness and convergence is still an open question to be answered. In this study, we investigate novel adaptive learning rate strategies at different levels based on the hyper-gradient descent framework and propose a method that adaptively learns the optimizer parameters by combining adaptive information at different levels. In addition, we show the relationship between regularizing over-parameterized learning rates and building combinations of adaptive learning rates at different levels. Moreover, two heuristics are introduced to guarantee the convergence of the proposed optimizers. The experiments on several network architectures, including feed-forward networks, LeNet-5 and ResNet-18/34, show that the proposed multi-level adaptive approach can significantly outperform many baseline adaptive methods in a variety of circumstances.


Introduction
Deep learning has become the most powerful technique in modern artificial intelligence systems, changing many aspects of our real life and improving the efficiency of a variety of industrial fields [20]. The successful training of deep neural networks requires carefully designed optimization algorithms. Not only because these algorithms can determine the model performance after training, but also they can determine the efficiency and convergence speed of training. With the wide-range application of large models with complex structures in recent years, training models can be expensive and time-consuming in practice, while the selection of optimizers could be of great importance [15,21,55].
The basic optimization algorithm for training deep neural networks is gradient descent method (GD), including stochastic gradient descent (SGD), mini-batch gradient descent and batch gradient descent [3]. Model parameters are updated according to the first-order gradients of the objective function with respect to the parameters being optimized, while back-propagation is implemented for calculating the gradients [24,29,46]. Traditionally, people consider learning rate as a global hyper-parameter to be tuned. However, with less or no adaptiveness, the training is difficult to converge and sensitive to the selection of learning rate. Training models with basic SGD algorithm usually requires a relatively long training time as well as carefully designed learning rate schedules [10,18]. Rule-based adaptive optimizers such as Adagrad, RMSProp and Adam achieve faster convergence speed in many scenarios, but their adaptation power is limited by the corresponding pre-designed updating rules [12,26,53].

3
Thankfully, auto-differentiation provides a technique for updating hyper-parameters with gradient-descent methods [7,13]. This makes it possible for achieving learning rate adaptation beyond manually designed methods. One example is hypergradient descent [6], which introduced the global learning rate adaptation framework for gradient-based optimizers such as SGD and Adam. Their method is shown to be successful in improving the convergence speed for multiple optimizers, also it demonstrates that the way of using hyper-gradient gradient for learning rate adaptation is a promising technique for improving existing optimizers. However, their study did not further investigate the detailed structures of parameters and corresponding learning rates adaptation techniques in complex neural network architectures. This actually limits the potential of auto-differentiation for learning rate adaptation. As is known, deep neural networks are composed of different levels of components such as blocks, layers and parameters, and it is reasonable to assume each component of the model is in favor of a specific learning rate in training. Thus, considering the detailed architectures of networks and exploring hyper-gradient descent for structured learning rate adaptation is a topic of interest and importance.
In this study, we propose a novel family of adaptive optimization algorithms based on the framework of hypergradient descent. By considering the regular hierarchical structures of deep neural networks, we introduce hierarchical learning rate structures correspondingly, which enables flexible and controllable learning rate adaptation. Meanwhile, to make the most of the gradient information from the training process, we apply both hyper-gradient descent method in multi-levels and the gradient-based updating of the combination weights of different levels. The main contribution of our study can be summarized as the following four points: • We introduce hierarchical learning rate structures for neural networks and apply hyper-gradient descent to obtain adaptive learning rates at different levels. • We introduce a set of regularization techniques for learning rates to address the balance of global and local adaptations and show the relationship with weighted combinations. • We propose an algorithm implementing trainable weighted combination of adaptive learning rates at multiple levels for model parameter updating. • Two techniques including weighted approximation and clipping are introduced to guarantee the convergence of the proposed optimization methods in training. • The experiments demonstrate that the proposed adaptation method can improve the performance of corresponding baseline optimizers in a variety of tasks with statistical significance.
The paper is organised as follows: Sect. 2 is a literature review of the related adaptive optimization algorithms in deep learning. Section 3 introduces the main idea of hypergradient descent algorithm. Section 4 is a detailed explanation of the proposed multi-level adaptation methods as well as a discussion on their convergence properties. Section 5 is the main experimental part that compares the proposed algorithm with a set of baselines on several benchmark datasets. In Sect. 6 we provide a further discussion on the hyper-parameter settings, the learning behavior of combination weights, time and space complexity, etc. Section 7 is the conclusion of the whole paper.

Literature review
Naïve gradient descent methods apply fixed learning rates without any adaptation mechanism. However, considering the change of available information during the learning process, SGD with fixed learning rates can result in slow convergence speed and requires a relatively large amount of computing resources in hyper-parameter searching. One solution is to introduce a learning rate adaptation rule, where "adaptation" means that the global or local learning rates or effective step sizes can be refined continuously during the training process in response to the change of inputs or other parameters. This idea can be traced back to the work on gain adaptation for connectionist learning methods [51] and related extensions for non-linear cases [48,59]. In recent years, optimizers with adaptive updating rules were developed in the context of deep learning, while the hyper-parameter learning rates are still fixed in training. The proposed methods include AdaGrad [12], Adadelta [61], RMSProp [53], and Adam [26]. In these methods, pre-designed updating rules provide adaptive step-sizes for parameter updating. The most widely used one, Adam, is shown to be quite effective in speeding up the training, which computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. Convergence is an essential property for optimization algorithms. There are many optimizers aiming to address the convergence issue in Adam. For example, it is noticed that Adam does not converge to the optimal solution for some stochastic convex optimization problems, while AMSGrad is introduced as a substitute with a convergence garantee [43]. Adabound further applies dynamic bound for gradient methods and build a gradual transition between adaptive approach and SGD [36]. RAdam was proposed to rectify the variance of the adaptive learning rate [34]. Adabelief optimizer [64] can achieve fast convergence, good generalization and training stability by adapting the stepsize according to the "belief" in the current gradient direction. Other techniques, such as Lookahead, can also achieve variance reduction and stability improvement with negligible extra computational cost [62]. Some analysis of adaptive optimizers in nonconvex stochastic optimization problems are provided in [60], which discovered that increasing minibatch sizes could circumvent the nonconvergence issues. In fact, through recent years, more studies with solid theoretical analysis are providing novel techniques and analyzing frameworks for the convergence of adaptive optimizers [1,9,33,54]. Moreover, to address the issue of large memory overheads for adaptive methods, memory-efficient adaptive optimization is developed, which could retains the benefits of standard per-parameter adaptivity [5].
Even though the adaptive optimizers with designed updating rules can converge faster than SGD in a wide range of tasks, the gradient information obtained during the training is not applied, while more hyper-parameters are introduced. Another idea is to use objective function information and update the learning rates as trainable parameters. These methods were introduced as automatic differentiation, where the hyper-parameters can be optimized with backpropagation [7,38]. As gradient-based hyper-parameter optimization methods, they can be implemented as an online approach [16]. With the idea of auto-differentiation, learning rates can be updated in real-time with the corresponding derivatives of the empirical risk [2], which can be generated to all types of optimizers for deep neural networks [6]. Another step size adaptation approach called "L4", is based on the linearized expansion of the objective functions, which rescales the gradient to make fixed predicted progress on the loss [45]. Meanwhile, layer-wise adaptation methods are also shown to be effective in accelerating the speed of large-batch training, which has been successfully applied in training large models or large datasets [57,58].
Another set of approaches train an RNN (recurrent neural network) agent to generate the optimal learning rates in the next step given the historical training information, which is known as "learning to learn" [4]. It empirically outperforms hand-designed optimizers in a variety of learning tasks, but another study shows that it may not be effective for long horizon [37]. The generalization ability can be improved by using meta training samples and hierachical LSTMs (Long Short-Term Memory) [56]. Still there are studies focusing on incorporating domain knowledge with LSTM-based optimizers to improve the performance in terms of efficacy and efficiency [17].
Beyond the adaptive learning rate, learning rate schedules can also improve the convergence of optimizers, including time-based decay, step decay, exponential decay [32]. The most fundamental and widely applied one is a piecewise step-decay learning rate schedule, which could vastly improve the convergence of SGD and even adaptive optimizers [34,36]. It can be further improved by introducing a statistical test to determine when to apply step-decay [28,63]. Also, there are works on warm-restart [35,41], which could improve the performance of SGD anytime when training deep neural networks.
The limitations of existing optimization algorithms are mainly in the following two aspects: (a) The existing gradient or model-based learning rate adaptation methods including hyper-gradient descent, L4 and learning to learn only focus on global adaptation. Meanwhile, some studies [57,58] show that updating rules with layer-wise adaptation is a promising technique for improving the convergence speed. Therefore, it is necessary to try to further extended adaptive optimizers to multi-level cases. (b) In the framework of hyper-gradient descent, no constraints or prior knowledge for learning rates are introduced, which limits its potential in balancing the local and global adaptiveness.
To tackle these limitations, our proposed algorithm is based on hyper-gradient descent but further introduce locally shared adaptive learning rates such as layer-wise, unit-wise and parameter-wise learning rates adaptation. Meanwhile, we introduce a set of regularization techniques for learning rates in order to balance the global and local adaptations.

Hyper-gradient descent
This section is dedicated to reviewing the auto-differentiation and hyper-gradient descent with detailed explanation and math formulas. The study of hyper-gradient descent in [6] provides a re-discovery the work of [2], in which the gradient with respect to the learning rate is calculated by using the updating rule of the model parameters in the last iteration. The gradient descent updating rule for model parameter can is given by Eq. (1): , the gradient of objective function with respect to the learning rate can then by calculated: A whole learning rate updating rule can be written as: In a more general prospective, assume that we have an updating rule for model parameters t = u( t−1 , t ) . We need to update the value of t towards the optimum value * t that minimizes the expected value of the objective in the next iteration. The corresponding gradient can be written as: where u( t−1 , t ) denotes the updating rule of a gradient descent method. Then the additive updating rule of learning rate t can be written as: where ∇ f ( t ) is the noisy estimator of ∇ f ( t ) . On the other hand, the multiplicative rule is given by: These two types of updating rules can be implemented in any optimizers including SGD and Adam, denoted by corresponding t = u( t−1 , t ).

Multi-level adaptation methods
Deep neural networks are composed of components in different levels, including blocks, layers, cell, neurons and parameters. It is natural to consider different parts in the hierarchical architectures are in favor of different learning rate in training. Based on hyper-gradient descent discussed in Sect. 3, and further motivated by the success of layer-wise adaptation methods in [57,58], we further consider the hierarchical structures of neural networks and introduce learning rate adaptation in different levels.

Layer-wise, unit-wise and parameter-wise adaptation
In the paper of hyper-descent [6], the learning rate is set to be a scalar. However, to make the most of learning rate adaptation for deep neural networks, we introduce updating rules in different levels of the network architectures, including layer-wise or even parameter-wise updating rules, where the learning rate t in each time step is considered to be a vector (layer-wise) or even a list of matrices (parameterwise). For the sake of simplicity, we collect all the learning rates in a vector: t = ( 1 , … , N ) T . Correspondingly, the objective f ( ) is a function of = ( 1 , 2 , … , N ) T , collecting all the model parameters. In this case, the derivative of the objective function f with respect to each learning rate can be written as: where N is the total number of all the model parameters.
Eq. (2) can be generalized to group-wise updating, where (2) we associate a learning rate with a special group of parameters, and each parameter group is updated according to its only learning rate. Assume t = u( t−1 , ) is the updating rule, where t = { s } t s=0 and is the learning rate, then the basic gradient descent method for each group i gives . Hence for gradient descent, Here i,t−1 is a scalar with index i at time step t − 1 , corresponding to the learning rate of the ith group, while the shape of ∇ i f ( ) is the same as the shape of i . We particularly consider three special cases: (1) In layerwise adaptation, i is the weight matrix of ith layer, and i is the particular learning rate for this layer. (2) In parameter-wise adaptation, i corresponds to a certain parameter involved in the model, which can be an element of the weight matrix in a certain layer. (3) We can also introduce unit-wise adaptation, where i is the weight vector connected to a certain neuron, corresponding to a column or a row of the weight matrix depending on whether it is the input or the output weight vector to the neuron concerned. [6] mentioned the case where the learning rate can be considered as a vector, which corresponds to layer-wise adaptation in this paper.

Regularization on learning rate
For the model involving a large number of learning rates for different groups of parameters, the updating for each learning rate only depends on a small number of examples. Therefore, when the batch size is also not large, the updating will involve a lot of noise due to applying small random samples, while over-parameterization is an issue to be concerned.
To address this issue, our original idea is to apply regularization on learning rates, where we introduce prior knowledge and assume that the low level adaptive learning rates (e.g. parameter-wise learning rates) are distributed around the high-level ones (e.g. global learning rates). Different regularization terms can be implemented to control the flexibility of learning rate adaptation and achieve the effect of variance reduction. First, for layer-wise adaptation, we can add the following regularization term to the cost function: where l is the indices for each layer, layer is the layer-wise regularization coefficient, l and g are the layer-wise and global-wise adaptive learning rates. A large layer can push the learning rate of each layer towards the average learning rate across all the layers. In the extreme case, this will lead to very similar learning rates for all layers, and the algorithm will be reduced to that in [6].
In addition, we can also consider the case where three levels of learning rate adaptations are involved, including global-wise, layer-wise and parameter-wise adaptation. If we introduce two more regularization terms to control the variation of parameter-wise learning rate with respect to layer-wise learning rate and global learning rates, the regularization loss can be written as: where p represents the index of each parameter within each layer. The second and third terms are the regularization terms pushing each parameter-wise learning rate towards the layer-wise learning rate, and the term of pushing the parameter-wise learning rate towards the global learning rates, while para_layer and para_layer are the corresponding regularization coefficients.
With these regularisation terms, the flexibility and variances of learning rates at different levels can be neatly controlled, while it can reduce to the basement case where a single learning rate for the whole model is used. In addition, there could still be one more regularization for improving the stability across different time steps, which can be used in the original hyper-descent algorithm where the learning rate in each time step is a scalar: where ts is the regularization coefficient to control the difference of learning rates between current step and the last step. With this term, the model with learning rate adaptation will be close to the model with fixed learning rate as large regularization coefficients are used. Thus, we can write the objective function of the full model as: where L model and L model_reg are the loss and regularization cost of basement model. L lr_reg can be any among L lr_reg_layer , L lr_reg_unit and L lr_reg_para depending on the specific requirement of the learning task, while the corresponding regularization coefficients can be optimized with random search for several extra dimensions.

Updating rules for learning rates
Considering these regularisation terms and take layer-wise adaptation for example, the gradient of the cost function with respect to a specific learning rate l in layer l can be written as: with the corresponding updating rule by naïve gradient descent: The updating rule for other types of adaptation can be derived accordingly. Notice that the time step index of layerwise regularization term is t rather than t − 1 , which ensures that we push the layer-wise learning rates towards the corresponding global learning rates of the current step. If we assume then Eq. (4) can be written as: In Eq. (5), both sides include the term of l,t , while the natural way to handle this is to solve for the close form of t , which gives: Equation (6) gives a close form solution but only applicable in the two-levels case. When there are more levels, components of learning rates at different levels can be interdependent. Meanwhile, there is an extra hyper-parameter layer to be tuned. To construct a workable updating scheme for Eq. (6), we replace l,t and g,t with their relevant approximations. We take the strategy of using their updated version without considering regularization, i.e., is the global h for all parameters. We define ̂l ,t and ̂g ,t as the "virtual" layer-wise and global-wise learning rates, where "virtual" means they are calculated based on the equation without regularization, and we do not use them directly for model parameter updating. Instead, we only use them as intermediate variables for calculating the real layer-wise learning rate for model training.
Notice that in Eq. (8), the first two terms is actually a weighted average of the layer-wise learning rate ̂l ,t and global learning rate ̂l ,t at the current time step. Since we hope to push the layer-wise learning rates towards the global one, the parameters should meet the constraint: 0 < 2 layer < 1 , and thus they can be optimized using hyper-parameter searching within a bounded interval. Moreover, gradient-based optimization on these hyper-parameters can also be applied. Hence both the layer-wise learning rates and the combination proportion of the local and global information can be learned with back propagation. This can be done in online or mini-batch settings. The advantage is that the learning process may be in favor of taking more account of global information in some periods, and taking more local information in some other periods to achieve the best learning performance, which is not taken into consideration by existing learning adaptation approaches. Now consider the difference between Eqs. (5) and (8): Based on the setting of multi-level adaptation, on the right-hand side of Eq. (9), global learning rate is updated without regularization ̂g ,t = g,t . For the layer-wise learning rates, the difference is given by , which corresponds to the gradient with respect to the regularization term. Thus, Eq. (9) can be rewritten as: which is the error of the virtual approximation introduced in Eq. (7). If 4 2 2 layer << 1 or g,t l,t → 1 , this approximation becomes more accurate.
Another way for handling Eq. (5) is to use the learning rates for the last step in the regularization term.
S i n c e we h ave l,t =̂l ,t − 2 layer ( l,t − g,t ) a n d ̂l ,t = l,t−1 + h l,t−1 , using the learning rates in the last step for regularization will introduce a higher variation from term h l,t−1 , with respect to the true learning rates in the current step. Thus, we consider the proposed virtual approximation works better than last-step approximation. Similar to the two-level's case, for the three-level regularization shown in Eq. (3), we have: For the sake of simple derivation, we denote 2 = layer , and 3 = para_layer for the regularization parameters in Eq. (3). The updating rule can be written as: where we assume that ̂p ,t , ̂l ,t , ̂g ,t are independent variables. Define Therefore, in the case of three level learning rates adaptation, the regularization effect can still be considered as applying the weighted combination of different levels of learning rates. This conclusion is invariant of the signs in the absolute operators in Eq. (7). In general, we can organize all the learning rates in a tree structure. For example, in three level case above, g will be the root node, while { l } are the children node at level 1 of the tree and { lp } are the children node of l as leave nodes at level three of the tree. In a general case, we assume there are L levels in the tree. Denote the set of all the paths from the root node to each of leave nodes as P and a path is denoted by p = { 1 , 2 , … , L } where 1 is the root node and L is the leave node on the path. On this path, denote ancestors(i) all the acenstor nodes of i along the path, i.e., ancestors(i) = { 1 , … , i−1 } . We will construct a regularizer to push i towards each of its parents. Then the regularization can be written as Under this pair-wise L 2 regularization, the updating rule for any leave node learning rate L can be given by the following theorem Theorem 1 Under virtual approximation, effect of adding pair-wise L 2 regularization on different levels of adaptive Remarks: Theorem 1 actually suggests that the similar updating rule can be obtained for the learning rate at the any level on the path. All these have been demonstrated in Algorithm 1 for the three level case.
Proof Consider the learning regularizer To apply hyper-gradient descent method to update the learning rate L at level L, we need to work the derivative of L lr_reg with respect to L , the terms in (10) involving L are only ( i − j ) 2 where j is an ancestor on the path from the root to the leave node L . Hence As there are exactly L − 1 ancestors on the path, we can simply use the index j = 1, 2, … , L − 1 . The corresponding updating function for n,t is: where This form satisfies * L = ∑ L j=1 ĵj with ∑ L j=1 j = 1 . This completes the proof. ◻ Therefore, by applying weighted linear combination of virtual learning rates in different levels as the effective learning rate for parameter updating, the effect of adding regularization on adaptive learning rates in Sect. 4.2 can be approximately achieved. The approximation error can be controlled by fixed parameters. This demonstrates that we can use a more convenient combination form to update the effective learning rates on leaves of the hierarchical model structures. Moreover, the combination form can be extended to the case of many levels.

Prospective of learning rate combination
Motivated by the analytical derivation and corresponding discussion in Sect. 4.3, we can consider the combination of adaptive learning rates in different levels as a substitute of regularization on the differences of learning rates. As a (10) .
simple case, the combination of global-wise and layer-wise adaptive learning rates can be written as t = 1̂l,t + 2̂g,t , where 1 + 2 = 1 and 1 ≥ 0 , 2 ≥ 0 . In a general form, assume that we have n levels, which could include globallevel, layer-level, unit-level and parameter-level, etc, we have: In a more general form, we can implement non-linear models such as neural networks to model the final adaptive learning rates with respect of the learning rates in different levels. Then the function is given by where is the vector of parameters of the non-linear model. In this study, we treat the combination weights { 1 , … , n } as trainable parameters as demonstrated in Eq. (11). Figure 1 gives an illustration of the linear combination of three-level hierarchical learning rates.
In fact, we only need these different levels of learning rate have a hierarchical relationship, which means the selection of component levels is not fixed. For example, in feed-forward neural networks, we can use parameter level, unit-level, layer level and global level. For recurrent neural networks, the corresponding layer level can either be the "layer of gate" within the cell structure such as LSTM and GRU, or the whole cell in a particular RNN layer. Especially, by "layer of gate" we mean the parameters in each gate of a cell structure share a same learning rate. Meanwhile, for convolutional neural network, we can further introduce "filter level" to replace layer-level if there is no clear layer structure, where the parameters in each filter will share a same learning rate.
As the real learning rates implemented in model parameter updating is a weighted combination, the corresponding Hessian matrices cannot be directly used for learning rate updating. If we take the gradients of the loss with respect to the combined learning rates, and use this to update the learning rate for each parameter, the procedure will be reduced to parameter-wise learning rate updating. To address this issue, we first break down the gradient by the combined learning rate to three levels, use each of them to update the learning rate at each level, and then calculate the combination by the updated learning rates. Especially, h p,t , h l,t and h(g, t) are calculated by the gradients of model losses without regularization, as is shown in Eq. (12). Algorithm 1 is the full updating rules for the newly proposed optimizer with three levels, which can be denoted as Combined Adaptive Multi-level Hyper-gradient Descent (CAM-HD).
. Notice that in each updating time step of Algorithm 1, we re-normalize the combination weights 1 , 2 and 3 to make sure that their summation is always 1 even after updating with stochastic gradient-based methods. An alternative way of doing this is to implemented softmax, which require an extra set of intermediate variables c p , c l and c g following: Then the updating of s will be convert to the updating of c's during training. In addition, the training of 's can also be extended to multi-level cases, which means we can have different combination weights in different layers. For the where we introduce the general form of gradient descent based optimizers [36,44].
In Algorithm 1, we use the general form of gradient descent based optimizers [36,44]. For SGD, updating rates p , l and g of the learning rates at different level, we set: Although Algorithm 1 involves many updating steps with intermediate parameters, the time complexity does not increase in large scale, which will be further discussed in Sect. 6.4. In fact, the updating of intermediate variables in the algorithm does not involve the dimension of batch size. Meanwhile, only parameter-wise adaptation requires a proportion of extra computational cost compared with standard back proportion, which can be avoided when layer-wise or cell-wise learning rates are applied as the lowest level adaptation.

Convergence analysis
The proposed CMA-HD is not an independent optimization method, which can be applied in any kinds of gradient-based updating rules. Its convergence properties highly depends on the base optimizer that is applied. By referring the discussion on convergence in [6], if we introduce p,t = (t) * p,t + (1 − (t)) ∞ , where the function (t) is selected to satisfy t (t) → 0 as t → ∞ , and ∞ is a selected constant value. Then we demonstrate the convergence analysis for the three level case in the following theorem, where ∇ p is the the gradient of target function w.r.t. a model parameter with index p, ∇ l is the average gradient of target function w.r.t. a parameters in a layer with index l, and ∇ g is the global average gradient of target function w.r.t. all model parameters.

Global
Layer-wise parameter -wise Fig. 1 The diagram of a three-level learning rate combination. Here we consider three levels of adaptive learning rates, which are calculated by global-level, layer-level and parameter-level hyper-gradient descent with different grouping strategies. The final effective learn-ing rate is a weighted combination of the three level adaptive learning rates, while the combination weights are also trainable during backpropagation where is a shared parameter. This setting will make the updating steps of learning rates in different levels be in the same scale considering the difference in the number of parameters involved in h p,t , h l,t , h g,t . If we take average based on the number of parameters in Eq. (12) at first, this adjustment is not required. CAM-HD is a higher-level adaptation approach, which can be applied with any gradient-based updating rules and advanced adaptive optimizers. For example, if we apply CAM-HD for Adam optimizer, we have Adam-CAM-HD. Similarly, when we apply CAM-HD for SGDN, we have SGDN-CAM-HD. Further, it can be merged with Adabound by adding an element-wise clipping procedure [36]: where * is the final step-size by original CAM-HD, l (t) and u (t) are the lower and upper bounds in adabound. t can be applied in replacing * p,t ∕ √ V t in our algorithm for merging two methods to so called "Adabound-CAM-HD". In the experiment part, we will follow the original paper of Adabound and related discussions to set l (t) = 0.1 − 0.1 and u (t) = 0.1 + 0.1 (1− 2 )t for both Adabound and Adabound-CAM-HD [36,47]. As the effective parameter-wise updating rates and corresponding gradients may change after clipping, the updating rules for other variable should be adjusted accordingly.

Theorem 2 (Convergence under mild assumptions about f) Suppose that f is convex and L-Lipschitz smooth with
where L is the Lipschitz constant for all the gradients and t ⋅ (t) → 0 as t → ∞ , where the t are generated according to (non-stochastic) gradient descent.
In the above theorem, ∇ p is the gradient of target function w.r.t. a model parameter with index p, ∇ l is the average gradient of target function w.r.t. parameters in a layer with index l, and ∇ g is the global average gradient of target function w.r.t. all model parameters. The proof of this theorem is given as follows.

Proof
We take three-level's case discussed in Sect. 4 for example, which includes global level, layer-level and parameter-level. Suppose that the target function f is convex, L-Lipschitz smooth in all levels, which gives for all 1 and 2 : and its gradient with respect to parameter-wise, layerwise, global-wise parameter groups satisfy ‖∇ p f ( )‖ < M p , ‖∇ l f ( )‖ < M l , ‖∇ g f ( )‖ < M g for some fixed M p , M l , M g and all . Then the effective combined learning rate for each parameter satisfies: where p,i refers to the value of parameter indexed by p at time step i, l,i refers to the set/vector of parameters in layer with index l at time step i, and g,i refers to the whole set of model parameters at time step i. In addition, n p and n l are the total number of parameters and number of the layers, and we have applied 0 < p , l , g < 1 . This gives an upper bound for the learning rate in each particular time step, which is O(t) as t → ∞ . By introducing p,t = (t) * p,t + (1 − (t)) ∞ , where the function (t) is selected to satisfy t (t) → 0 as t → ∞ , so we have p,t → ∞ as t → ∞ . If ∞ < 1 L , for larger enough t, we have 1∕(L + 1) < p,t < 1∕L , and the algorithm converges when the corresponding gradient-based optimizer converges for such a learning rate under our assumptions about f. This follows the discussion in [25,50]. ◻ This actually provides a convergence of R(T) = O(T) given the assumptions and conditions, while a stronger convergence can be achieved by assuming a more strict form of (t) . Notice that when we introduce p,t instead of * p,t in Algorithm 1, the corresponding gradients L( ) * p,t−1 will also be , which is given by: 1 3 In general, the main ideas involved in the proof of convergence of Adabound in [36] and [47] is also applicable for Adabound-CAM-HD. The main procedure can be given as follows.
• Let x * = argmin x∈F ∑ T t−1 f t (x) , which exists since F is closed and convex. Apply the projection relationship: where t is the effective updating rate at step t after clipping and normalizing, while m t is the momentum at step t. • Apply Lemma 1 in the original paper [36,39] with u 1 = x t+1 and u 2 = x * to get the upper bound of In Adabound-CAM-HD, t depends on the format of * . If * is a matrix for parameter-wise learning rate for a particular layer, t should also be a matrix but clipped by a pair of global u (t) and l (t) at each time step t. Notice that in [36] with the code provided by github.com/Luolc/AdaBound, the clipping is also parameter-wise because in the clipping function ̂t = Clip( ∕ √ V t , l (t), u (t)) , the term ∕ √ V t will generate a matrix with the same shape as the corresponding parameter matrix in each layer, although the step size is a scalar. Thus, the clipping on element-wise division Adabound-CAM-HD could achieve the same bounding properties. For example, if the scale is adjusted accordingly, the norm of t satisfies √ t�� t �� ∞ ≤ R ∞ , which makes the Lemma 3 in [36] holds. Meanwhile, Lemma 1 and Lemma 2 in the original paper can be directly applied as parameter t , * and m t have element-wise values in both contexts. Although in the proposed algorithm, we introduced the hierarchical learning rate structure, for the model parameter, gradients and momentum, parameter-wise form has already been applied.
This ensures Adabound-CAM-HD achieves a high level of adaptiveness as well as a good convergence property. Notice that in the original version of the convergence theorem of Adabound proposed in [36], the assumptions for upper and lower bound was given by l (t + 1) ≥ l (t) > 0 , (1) . However, in [47], it is pointed out that this original assumption can only guarantee a con- . Therefore, both weighted approximation and clipping can be applied to guarantee the convergence of optimizers with CAM. This means that CAM can safely achieve both highlevel parameter-specific adaptiveness and a good property of convergence, which gives it the potential of outperforming most of existing optimization algorithms. Meanwhile, it is compatible with any gradient based optimizers and network architectures.

Experiments
We use the feed-forward neural network models and different types of convolutions neural networks on multiple benchmark datasets to compare with existing baseline optimizers. For each learning task, the following optimizers will be applied: (a) standard baseline optimizers such as Adam and SGD; (b) hyper-gradient descent in [6]; (c) L4 stepsize adaptation for standard optimizers [45]; (d) Adabound optimizer [36]; (e) RAdam optimizer [34]; and (f) the proposed adaptive combination of different levels of hyper-descent. The implementation of (b) is based on the code provided with the original paper. One NVIDIA Tesla V100 GPU with 16G Memory 61 GB RAM and two Intel Xeon 8 Core CPUs with 32 GB RAM are applied. The program is built in Python 3.5.1 and Pytorch 1.0 [49]. For each experiment, we provide both the average curves and standard error bars for ten runs.

Hyper-parameter tuning
To compare the effect of CAM-HD with baseline optimizers, we first do hyperparameter tuning for each learning task by referring to related papers [6,26,36,45] as well as implementing an independent grid search [8,13]. We mainly consider hyper-parameters including batch size, learning rate, and other optimizer parameters for models with different architectures. Other settings in our experiments follow open-source benchmark models. The search space for batch size is the set of {2 n } n=3,…,9 , while the search space for learning rate, hyper-gradient updating rate and combination weight updating rate (CAM-HDlr) are {10 −1 , 10 −2 , … , 10 −4 } , {10 −1 , 10 −2 , … , 10 −10 } and {0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001} , respectively. The selection criterion is the 5-fold cross-validation loss by early-stopping at the patience of 3 [42]. The optimized hyper-parameters for the tasks in this paper are given in Table 1. For training ResNets with SGDN, we will apply a step-wise learning rate decay schedule as in [34,36]. Notice that although the hyper-parameters are tuned, it does not mean that the model performance is sensitive to each hyper-parameter.
For training ResNets with SGDN, we will apply a stepwise learning rate decay schedule as in [34,36]. Notice that although the hyper-parameters are tuned, it does not mean that the model performance is sensitive to each of them.

Combination ratio and model performances
First, we perform a study on the initialization of the combination weights different level learning rates in the framework of CAM-HD. The simulations are based on image classification tasks on MNIST and CIFAR10 [27,30]. We use full training sets of MNIST and CIFAR10 for training and full test sets for validation. One feed-forward neural network with three hidden layers of size [100, 100, 100] and two convolutional network models, including LeNet-5 [31] and ResNet-18 [23], are implemented. In each case, two levels of learning rates are considered, which are the global and layer-wise adaptation for FFNN, and global and filter-wise adaptation for CNNs. For LeNet-5 and FFNN, Adam-CAM-HD with fixed and trainable combination weights is implemented, while for ResNet-18, both Adam-CAM-HD and SGDN-CAM-HD with fixed and trainable combination weights are implemented in two independent simulations. We change the initialized combination weights of two levels in each case to see the change of model performance in terms of test classification accuracy at epoch 30 for FFNN, and at epoch 10 for LeNet-5 and ResNet-18. Also we compare CAM-HD methods with baseline Adam and SGDN methods in terms of test accuracy after the same epochs of training. Other hyper-parameters are optimized based on Sect. 5.1. We conduct 10 runs at each combination ratio and draw the average accuracies and corresponding error bars (standard errors). The result is given in Fig. 2, which leads to the following findings: First, usually the optimal performance is neither at full global level nor full layer/filter level, but a weighted combination of two levels of adaptive learning rates, for both update and no-update cases. Second, CAM-HD methods outperform baseline Adam/ SGDN methods for most of the combination ratios initializations. Third, updating of combination weights is effective and helpful in achieving better performance than applying fixed combination weights. This supports our analysis in Sect. 4.3. Also, in real training processes, it is possible that the learning in favor of different combination weights in various stages and this requires the online adaptation of the combination weights.  For Adam-CAM-HD, we apply three-level parameterlayer-global adaptation with initialization of 1 = 2 = 0.3 and 3 = 0.4 , and two-level layer-global adaptation with 1 = 2 = 0.5 . No decay function of learning rates is applied. Figure 3 shows the validation accuracy curves for different optimizers during the training process of 30 epochs. We can learn that both the two-level and three-level Adam-CAM-HD outperform the baseline Adam optimizer with optimized hyper-parameters significantly. For Adam-HD, we find that the default hyper-gradient updating rate ( = 10 −7 ) for Adam applied in [6] is not optimal in our experiments, while an optimized one of 10 −9 can outperform Adam but still worse than Adam-CAM-HD with = 10 −7 .

Feed forward neural network for image classification
The test accuracy of each setting and the corresponding standard error of the sample mean in 10 trials are given in Table 2.

Lenet-5 for image classification
The second experiment is done with LeNet-5, a classical convolutional neural network without involving many building and training tricks [31]. We compare a set of adaptive Adam optimizers including Adam, Adam-HD, Adam-CAM-HD, Adabound, RAdam and L4 for the image classification learning task of MNIST, CIFAR10 and SVHN [40]. For Adam-CAM-HD, we apply a two-level setting with   filter-wise and global learning rates adaptation and initialize 1 = 0.2 , 2 = 0.8 . We also implement an exponential decay function (t) = exp(−rt) as was discussed in Sect. 4.5 with rate r = 0.002 for all the three datasets, while t is the number of iterations. For L4, we implement the recommended L4 learning rate of 0.15. For Adabound and RAdam, we also apply the recommended hyper-parameters in the original papers. The other hyper-parameter settings are optimized in Sect. 5.1.
As we can see in Fig. 4, Adam-CAM-HD again shows the advantage over other methods in all the three subexperiments, except MNIST L4 that could perform better in a later stage. The experiment on SVHN indicates that the recommended hyper-parameters for L4 could fail in some cases with unstable accuracy curves. RAdam and Adabound outperform baseline Adam method on MNIST, while Adam-HD does not show a significant advantage over Adam with optimized hyper-gradient updating rate that is shared with Adam-CAM-HD.
The corresponding summary of test performance is given in Table 3, in which the test accuracy of Adam-CAM-HD outperform other optimizers on both CIFAR10 and SVHN. Especially, it gives significantly better results than Adam and Adam-HD for all the three datasets.

ResNet for image classification
In the third experiment, we apply ResNets for image classification task on CIFAR10 [11,23] following the code provided by github.com/kuangliu/pytorch-cifar, where a ResNet-18 gives an accuracy of 93.02 with SGD and the cosine annealing learning rate schedule. We compare Adam and Adam-based adaptive optimizers, as well as SGD with Nestorov momentum (SGDN) and corresponding adaptive optimizers for training both ResNet-18 and ResNet-34. For SGDN methods, we apply a learning rate schedule, in which the learning rate is initialized to a default value of 0.1 and reduced to 0.01 or 10% (for SGDN-CAM-HD) after epoch 150. The momentum is set to be 0.9 for all SGDN methods. For Adam-CAM-HD SGDN-CAM-HD, we apply two-level CAM-HD with the same setting as the second experiment. We also implement Adabound-CAM-HD discussed in Sect. 4.4 by sharing the common parameters with Adabound. In addition, we apply an exponential decay function with a decay rate r = 0.001 for all the CAM-HD methods. The learning curves for validation accuracy, training loss, and validation loss of ResNet-18 and ResNet-34 are shown in Fig. 5.
We can see that the validation accuracy of Adam-CAM-HD reaches about 90% in 40 epochs and consistently    Table 4. 1 We can learn that for both ResNet-18 and ResNet-34, the proposed CAM-HD methods (Adam-CAM-HD, Adabound-CAM-HD and SGDN-CAM-HD) can improve the corresponding baseline methods (Adam, Adabound and SGDN) with statistical significance. Especially, Adabound-CAM-HD outperforms both Adam-CAM-HD and Adabound.

Discussion
The experiments on both small models and large models demonstrate the advantage of the proposed method over baseline optimizers in terms of validation and test accuracy. One explanation of the performance improvement of our method is that it achieves a higher level of adaptation  by introducing hierarchical learning rate structures with learn-able combination weights, while the parameterization level of adaptive learning rates is controlled by its intrinsic regularization effects. In addition, both weighted approximation and clipping can be applied to guarantee a convergence. In this section we discuss several aspects of our study, including hyper-parameter settings, learning of combination weights, number of parameters and space and time complexity.

Performance and hyper-parameter settings
Experiments show that the performance improvement does not require tuning the hyper-parameters independently if the task or model is similar. For example, the hyper-gradient updating rate for LeNet-5, ResNet-18 and ResNet-34 are all set to be 1e-8 in our experiments no matter the dataset being learned. Also, the hyper-parameter CAM-HD-lr is shared among each group of models (FFNNs, LeNet-5, ResNets) for all datasets being learned. For the combination ratio, 1 = 0.2 , 2 = 0.8 works for all our experiments with convolutional networks. However, as the loss surface with respect to the combination weights may not be convex for deep learning models, the learning of combination weights may fall into local optimal. Therefore, it is possible that several trials are needed to find a good initialization of combination weights although the learning of combination weights works locally [13]. In general, the selected hyper-parameters are transferable to a similar task for an improvement from the corresponding baseline, while the optimal hyper-parameter setting may shift a bit.
The proposed CAM-HD method can also apply learning rate schedules in many ways to achieve further improvement. One example is our ResNet experiment on CIFAR10 with SGDN and SGDN-CAM-HD. For more advanced learning rate schedules [19,28], we can apply strategies like piecewise adaptive scheme by re-initialize all the levels for different steps. Another method is to replace global level learning rate with scheduled learning rate, while adapting the combination weights and other levels continuously.

Learning of combination weights
The following figures including Figs. 6, 7, 8 and 10 give the learning curves of combination weights with respect to the number of training iterations in each experiments, in which each curve is averaged by 5 trials with error bars. Through these figures, we can compare the updating curves with different models, different datasets and different CAM-HD optimizers. Figure 6 corresponds to the experiment of FFNN on MNIST in Sect. 3.3 of the main paper, which is a threelevel case. We can see that for different FFNN architecture, the learning behaviors of s also show different patterns, although trained on a same dataset. Meanwhile, the standard errors for multiple trials are much smaller relative to the changes of the average combination weight values. Figure 7 corresponds to the learning curves of s in the experiments of LeNet-5 for MNIST image classification with SGD, SGDN and Adam, which are trained on 10% of original training dataset. In addition, Fig. 8 corresponds to the learning curves of s in the experiments of LeNet-5 for CIFAR10 and SVHN image classification with Adam-CAM-HD.
As is shown in Fig. 7, for SGD-CAM-HD, SGDN-CAM-HD and Adam-CAM-HD, the equilibrium values of combination weights are different from each other. Although the initialization 1 = 0.2 , 2 = 0.8 and the updating rate = 0.03 are set to be the same for the three optimizers, the values of 1 and 2 only change in a small proportion when training with Adam-CAM-HD, while the change is much more significant towards larger filter/layer-wise adaptation when SGD-CAM-HD or SGDN-CAM-HD is implemented. The numerical results show that for SGDN-CAM-HD, the average value of weight for layer-wise adaptation 1 jumps from 0.2 to 0.336 in the first epoch, then drop back to 0.324 before keeping increasing till about 0.388. For Adam-CAM-HD, the average 1 moves from 0.20 to 0.211 with about 5% change. In Fig. 8, both the two subplots are about LeNet-5 models trained with Adam-CAM-HD, while the exponential decay rate for weighted approximation is set to be = 0.002 . For the updating curves in Fig. 8a, which is trained on CIFAR10 with Adam-CAM-HD, the combination weight for filter-wise adaptation moves from 0.20 to 0.188. Meanwhile, for the updating curves in Fig. 8b, which is trained on SVHN, the combination weight for filter-wise adaptation moves from 0.20 to 0.195. Further exploration shows that has an impact on the learning curves of combination weights. As is shown by Fig. 9, a smaller = 0.001 can result in a more significant change of combination weights during training with Adam-CAM-HD. The similar effect can also be observed from the learning curves of s for ResNet-18, which is given in Fig. 10 and we only take the first 8000 iterations. Again, we find that in training ResNet-18 on CIFAR10, the combination weights of SGD/ SGDN-CAM-HD change much faster than that of Adam-CAM-HD. There are several reasons for this effect: First, in the cases when s do not move significantly, we apply Adam-CAM-HD, where the main learning rate (1e-3) is only about 1%-6% of the learning rate of SGD or SGDN (1e-1). In Algorithm 1, we can see that the updating rate of s is in proportion of alpha given other terms unchanged. Thus, for the same tasks, if the same value of updating rate is applied, the updating scale of s for Adam-CAM-HD can be much smaller than that for SGDN-CAM-HD. Second, this does not mean that if we apply a much larger for Adam-CAM-HD, the combination weights will still not change significantly or the performance will not be improved. It simply means that using a small can also achieve good performance due to the goodness of initialisation points. Third, it is possible that Adam requires lower level of combination ratio adaptation for the same network architecture compared with SGD/SGDN due to the fact that Adam itself involves stronger adaptiveness.

Number of parameters and space complexity
The proposed adaptive optimizer is for efficiently updating the model parameters, while the final model parameters will not be increase by introducing CMA-HD optimizer. However, during the training process, several extra intermediate variables are introduced. For example, in the discussed three-level's case for feed-forward neural network with n layer layers, we need to restore h p,t , h l,t and h g,t , which have the sizes of S(h p,t ) = ∑n layer −1 l=1 (n l + 1)n l+1 , S(h l,t ) = n layer and S(h g,t ) = 1 , respectively, where n i is the number of units in ith layer. Also, learning rates p,t , l,t , g,t and take the sizes of S(a p,t ) = ∑n layer −1 l=1 (n l + 1)n l+1 , S(a l,t ) = n layer , S(a g,t ) = 1 , S(a g,t ) = 1 , and S(a * p,t ) = ∑n layer −1 l=1 (n l + 1)n l+1 , respectively.
Also we need a small set of scalar parameters to restore 1 , 2 and 3 and other coefficients. Consider the fact that the training the baseline models, we need to restore model parameters, corresponding gradients, as well as the intermediate gradients during the implementation of chain rule, CAM-HD will take twice of the space for storing intermediate variables in the worst case. For two-level learning rate adaptation considering global and layer-wise learning rates, the extra space complexity by CAM-HD will be one to two orders' smaller than that of baseline model during training.

Time complexity
In CMA-HD, we need to calculate gradient of loss with respect to the learning rates in each level, which are h p,t , h l,t and h g,t in three-level's case. However, the gradient of each parameter is already known during normal model training, the extra computational cost comes from taking summations and updating the lowest-level learning rates. In general, this cost is in linear relation with the number of differentiable parameters in the original models. Here we discuss the case of feed-forward networks and convolutional networks.
Recall that for feed-forward neural network the whole computational complexity is: where m is the number of training examples, n iter is the iterations of training, n l is the number of units in the l-th layer. On the other hand, when using three-level CAM-HD with, where the lowest level is parameter-wise, we need n layer element products to calculate h p,t for all layers, one n layer matrix element summations to calculate h l,t for all layers, as well as a list summation to calculate h g,t . In addition, two elementwise summations will also be implemented for calculating p,t and * p . Therefore, the extra computational cost of using CAM-HD is ΔT(n) = O(m b ⋅ n iter ∑n layer l=2 (n l ⋅ n l−1 + n l )) , where m b is the number of mini-batches for training. Notice that m∕m b is the batch size, which is usually larger than 100. This extra cost is more than one-order smaller than the computation complexity of training a model without learning rate adaptation. For the cases when the lowest level is layer-wise, only one element-wise matrix product is needed in each layer to calculate h l,t . For convolutional neural networks, we have learned that the total time complexity of all convolutional layers is [22]: where l is the index of a convolutional layer, and n conv_layer is the depth (number of convolutional layers). n l is the number of filters in the l-th layer, while n l−1 is known as the number of input channels of the l-th layer. s l is the spatial size of the filter. m l is the spatial size of the output feature map. If we consider convolutional filters as layers, the extra computational cost for CAM-HD in this case is ΔT(n) = O(m b ⋅ n iter ∑n conv_layer l=1 ((n l−1 ⋅ s 2 l + 1) ⋅ n l )) , which is still more than one order smaller than the cost of model without learning rate adaptation. Therefore, for large networks, applying CMA-HD will not significantly increase the computational cost from the theoretical prospective.

Conclusion
In this study, we propose a gradient-based learning rate adaptation strategy by introducing hierarchical learning rate structures in deep neural networks. By considering the relationship between regularization and the combination of adaptive learning rates in multiple levels, we further propose a joint algorithm for adaptively learning each level's combination weight (CAM). It increases the adaptiveness of the hyper-gradient descent method in any single level, while over-parameterization involved in optimizers can be controlled by adaptive regularization effect. In addition, both weighted approximation and clipping can be applied to guarantee the convergence. The proposed CAM algorithm is compatible with any gradient based optimizers, learning rate schedules and network architectures. Experiments on FFNN, LeNet-5, and ResNet-18/34 show that the proposed methods can outperform the standard ADAM/SGDN and other baseline methods with statistical significance.