Adaptive Multi-level Hyper-gradient Descent

,


Introduction
The basic optimization algorithm for training deep neural networks is gradient descent method (GD), including stochastic gradient descent (SGD), mini-batch gradient descent and batch gradient descent.Model parameters are updated according to the first-order gradients of the empirical risks with respect to the parameters being optimized, while backpropagation is implemented for calculating the gradients of parameters (Ruder, 2016).Naïve gradient descent methods apply fixed learning rates without any adaptation mechanisms.However, considering the change of available information during the learning process, SGD with fixed learning rates can result in inefficiency and a waste of computing resources in hyper-parameter searching.One solution is to introduce adaptive updating rules, while the learning rates are still fixed in training.This leads to the proposed methods include AdamGrad (Duchi et al., 2011), RMSProp (Tieleman and Hinton, 2012), and Adam (Kingma and Ba, 2015).Also there are optimizers aiming at addressing the convergence issue in Adam (Reddi et al., 2019;Luo et al., 2018), or rectify the variance of the adaptive learning rate (Liu et al., 2019).Other techniques such as lookahead could also achieve variance reduction and stability improvement with negligible extra computational cost (Zhang et al., 2019).
Even though the adaptive optimizers with fixed learning rates can converge faster than SGD in a wide range of tasks, the updating rules are designed manually, while more hyper-parameters are introduced.Another idea is to use the information of objective function and to update the learning rates as trainable parameters.This set of methods was introduced as automatic differentiation, where the hyper-paremeters can be optimized with backpropagation (Maclaurin et al., 2015;Baydin et al., 2018).As gradient-based hyper-parameter optimization methods, they can be implemented as an online approach (Franceschi et al., 2017).With the idea of auto-differentiation, learning rates can be updated in real time with the corresponding derivatives of the empirical risk (Almeida et al., 1998), which can be generated to all types of optimizers for deep neural networks (Baydin et al., 2017).Another step size adaptation approach called "L4" is based on the linearized expansion of the loss function, which focuses on minimizing the need of learning rate tunning with strong reproducible performance across multiple different architectures (Rolinek and Martius, 2018).Further more, by addressing the issue of poor generalization performance of adaptive methods, dynamic bound for gradient methods was introduced to build a gradual transition between adaptive method and SGD (Luo et al., 2018).
Another set of approaches train an RNN (recurrent neural network) agent to generate the optimal learning rates in the next step given the historical training information, which is known as "learning to learn" (Andrychowicz et al., 2016).It empirically outperforms hand-designed optimizers in a variety of learning tasks, but another study shows that it may not be effective for long horizon (Lv et al., 2017).The generalization ability can be improved by using meta training samples and hierachical LSTMs (Long Short-Term Memory) (Wichrowska et al., 2017).Still there are studies focusing on incorporating domain knowledge with LSTM-based optimizers to improve the performance in terms of efficacy and efficiency (Fu et al., 2017).
The limitations of existing algorithms are mainly in the following two aspects: (a) The proposed hyper-descent only focuses on the case of global adaptation of learning rates.Even though the original paper mentions that their approach can be generalized to the case where the learning rate is an vector, it is still necessary to investigate whether different levels of parameterization could make a difference in model performance as well as training efficiency.(b) No constraints or prior knowledge for learning rates are introduced in the framework of hyper-descent, which could be essential in resolving the issue of overparameterization when a large number of independent learning rates need to be optimized.
In this study, we propose an algorithm based on existing works on hyper-descent but extend it to layer-wise, unit-wise and parameter-wise learning rates adaptation.In addition, we introduce a set of regularization techniques for learning rates for the first time to address the balance of global and local adaptation, which is also helpful in solving the issue of over-parameterization as a large number of learning rates are being learned.Although these regularizers indicate that extra hyper-parameters need to be optimized, the model performance after training could be improved with this setting in a large range of tasks.The main contribution of our study can be summarized as by the following three items: • We propose an algorithm based on existing works on hyper-gradient descent but extend it to layer-wise, unit-wise and parameter-wise learning rates adaptations.
• We introduce a set of regularization techniques for learning rates for the first time to address the balance of global and local adaptation, which is also helpful in controlling over-parameterization as a large number of learning rates are being learned.
• We propose an algorithm for implementing the combination of adaptive learning rates in different levels for model parameter updating.
The structure of this chapter is organized as follows: Section 2 summarizes the related works on auto-differentiation, especially the hyper-descent (HD) algorithms.Section 3 explains the method implemented in extending the existing works.Section 4 shows the results of experiments on different learning tasks with a variety of models.Section 5 discusses the validity of the experiment results and Section 6 concludes the study.

Related work
This section is dedicated to reviewing the auto-differentiation and hyper-descent with detailed explanation and math formulas.In the original study of hyper-gradient descent (Baydin et al., 2017), the gradient with respect to the learning rate is calculated by using the updating rule of the model parameters in the last iteration.The gradient descent updating rule for model parameter θ can is given by Eq. ( 1): , the gradient of objective function with respect to learning rate can then by calculated: A whole learning rate updating rule can be written as: In a more general prospective, assume that we have an updating rule for model parameters θ t = u(Θ t−1 , α t ).We need to update the value of α t towards the optimum value α * t that minimizes the expected value of the objective in the next iteration.The corresponding gradient can be written as: where u(Θ t−1 , α t ) denotes the updating rule of a gradient descent method.Then the additive updating rule of learning rate α t can be written as: where ∇θ f (θ t ) is the noisy estimator of ∇ θ f (θ t ).On the other hand, the multiplicative rule is given by: These two types of updating rules can be implemented in any optimizers including SGD and Adam, denoted by corresponding θ t = u(Θ t−1 , α t ).

Multi-level adaptation methods
In this study we propose a combination form of adaptive learning rates, where the final learning rate applied for model parameter updating is the weighted combination of different level of adaptive learning rates, while the combination weights can also be trained with back-propagation.This give the similar effect with adding regularization on learning rates with certain kind of baselines.First we introduce the learning rate adaptation in different levels.

Layer-wise, unit-wise and parameter-wise adaptation
In the paper of hyper-descent (Baydin et al., 2017), the learning rate is set to be a scalar.However, to make the most of learning rate adaptation, in this study we introduce layerwise or even parameter-wise updating rules, where the learning rate α t in each time step is considered to be a vector (layer-wise) or even a list of matrices (parameter-wise).For the sake of simplicity, we collect all the learning rates in a vector: Correspondingly, the objective f (θ) is a function of θ = (θ 1 , θ 2 , ..., θ N ) T , collecting all the model parameters.In this case, the derivative of the objective function f with respect to each learning rate can be written as: where N is the total number of all the model parameters.Eq. ( 7) can be generalized to group-wise updating, where we associate a learning rate with a special group of parameters, and each parameter group is updated according to its only learning rate.Assume θ t = u(Θ t−1 , α) is the updating rule, where Θ t = {θ s } t s=0 and α is the learning rate, then the basic gradient descent method for each group i gives θ Here α i,t−1 is a scalar with index i at time step t − 1, corresponding to the learning rate of the ith group, while the shape of ∇ θ i f (θ) is the same as the shape of θ i .
We particularly consider three special cases: (1) In layer-wise adaptation, θ i is the weight matrix of ith layer, and α i is the particular learning rate for this layer.
(2) In parameter-wise adaptation, θ i corresponds to a certain parameter involved in the model, which can be an element of the weight matrix in a certain layer.
(3) We can also introduce unit-wise adaptation, where θ i is the weight vector connected to a certain neuron, corresponding to a column or a row of the weight matrix depending on whether it is the input or the output weight vector to the neuron concerned.Baydin et al. (2017) mentioned the case where the learning rate can be considered as a vector, which corresponds to layer-wise adaptation in this paper.

Regularization on learning rate
For the model involving a large number of learning rates for different groups of parameters, the updating for each learning rate only depends on the average of a small number of examples.Therefore, when the batch size is also not large, over-parameterization is an issue to be concerned.
The idea in this study is to introduce regularization on learning rates, which can be implemented to control the flexibility of learning rate adaptation.First, for layer-wise adaptation, we can add the following regularization term to the cost function: where l is the indices for each layer, λ layer is the layer-wise regularization coefficient, α l and α g are the layer-wise and global-wise adaptive learning rates.A large λ layer can push the learning rate of each layer towards the average learning rate across all the layers.In the extreme case, this will lead to very similar learning rates for all layers, and the algorithm will be reduced to that in (Baydin et al., 2017).
In addition, we can also consider the case where three levels of learning rate adaptations are involved, including global-wise, layer-wise and parameter-wise adaptation.If we introduce two more regularization terms to control the variation of parameter-wise learning rate with respect to layer-wise learning rate and global learning rates, the regularization loss can be written as: where p represents the index of each parameter within each layer.The second and third terms are the regularization terms pushing each parameter-wise learning rate towards the layer-wise learning rate, and the term of pushing the parameter-wise learning rate towards the global learning rates, while λ para layer and λ para layer are the corresponding regularization coefficients.
With these regularisation terms, the flexibility and variances of learning rates in different levels can be neatly controlled, while it can reduce to the basement case where a single learning rate for the whole model is used.In addition, there could still be one more regularization for improving the stability across different time steps, which can be used in the original hyper-descent algorithm where the learning rate in each time step is a scalar: where λ ts is the regularization coefficient to control the difference of learning rates between current step and the last step.With this term, the model with learning rate adaptation will be close to the model with fixed learning rate as large regularization coefficients are used.Thus, we can write the loss function of the full model as: where L model and L model reg are the loss and regularization cost of basement model.L lr reg can be any among L lr reg layer , L lr reg unit and L lr reg para depending on the specific requirement of the learning task, while the corresponding regularization coefficients can be optimized with random search for several extra dimensions.

Updating rules for learning rates
Considering these regularisation terms and take layer-wise adaptation for example, the gradient of the cost function with respect to a specific learning rate α l in layer l can be written as: with the corresponding updating rule by naïve gradient descent: The updating rule for other types of adaptation can be derived accordingly.Notice that the time step index of layer-wise regularization term is t rather than t − 1, which ensures that we push the layer-wise learning rates towards the corresponding global learning rates of the current step.If we assume then Eq. ( 14) can be written as: In Eq. ( 16), both sides include the term of α l,t , while the natural way to handle this is to solve for the close form of α t , which gives: In this formula, we still need to calculate α g,t , which is the global average learning rate in the current step.It will be even harder to calculate when there are multiple levels of learning rates, while the regularization still depends on their values in the current step.A more clean and probably computational efficient way of handling Eq. ( 16) is to introduce approximations to get rid of α l,t in the right hand side.If we do not consider the effect of regularization terms, the updating rule for layer-wise and global-wise learning rates can be written as: where is the global h for all parameters.We define αl,t and αg,t as the "virtual" layer-wise and global-wise learning rates, where "virtual" means they are calculated based on the equation without regularization, and we do not use them directly for model parameter updating.Instead, we only use them as intermediate variables for calculating the real layer-wise learning rate for model training.
Notice that in Eq. ( 19), the first two terms is actually a weighted average of the layerwise learning rate αl,t and global learning rate αl,t at the current time step.Since we hope to push the layer-wise learning rates towards the global one, the parameters should meet the constraint: 0 < 2βλ layer < 1, and thus they can be optimized using hyperparameter searching within a bounded interval.Moreover, gradient-based optimization on these hyper-parameters can also be applied.Hence both the layer-wise learning rates and the combination proportion of the local and global information can be learned with back propagation.This can be done in online or mini-batch settings.The advantage is that the learning process may be in favor of taking more account of global information in some periods, and taking more local information in some other periods to achieve the best learning performance, which is not taken into consideration by existing learning adaptation approaches.
Now consider the difference between Eq. ( 16) and Eq. ( 19): Based on the setting of multi-level adaptation, on the right-hand side of Eq. ( 20), global learning rate is updated without regularization αg,t = α g,t .For the layer-wise learning rates, the difference is given by αl,t − α l,t = 2βλ layer (α l,t − α g,t ), which corresponds to the gradient with respect to the regularization term.Thus, Eq. ( 20) can be rewritten as: which is the error of the virtual approximation introduced in Eq. ( 18).If 4β 2 λ 2 layer << 1 or αg,t α l,t → 1, this approximation becomes more accurate.
Another way for handling Eq. ( 16) is to use the learning rates for the last step in the regularization term.
Since we have α l,t = αl,t −2βλ layer (α l,t −α g,t ) and αl,t = α l,t−1 +βh l,t−1 , using the learning rates in the last step for regularization will introduce a higher variation from term βh l,t−1 , with respect to the true learning rates in the current step.Thus, we consider the proposed virtual approximation works better than last-step approximation.
Similar to the two-level's case, for the three-level regularization shown in Eq. ( 10), we have: For the sake of simple derivation, we denote λ 2 = λ layer , and λ 3 = λ para layer for the regularization parameters in Eq. ( 10).The updating rule can be written as: where we assume that αp,t , αl,t , αg,t are independent variables.Define we still have: Therefore, in the case of three level learning rates adaptation, the regularization effect can still be considered as applying the weighted combination of different levels of learning rates.This conclusion is invariant of the signs in the absolute operators in Eq. ( 18).
In general, we can organize all the learning rates in a tree structure.For example, in three level case above, α g will be the root node, while {α l } are the children node at level 1 of the tree and {α lp } are the children node of α l as leave nodes at level three of the tree.In a general case, we assume there are L levels in the tree.Denote the set of all the paths from the root node to each of leave nodes as P and a path is denoted by p = {α 1 , α 2 , ..., α L } where α 1 is the root node and α L is the leave node on the path.On this path, denote ancestors(i) all the acenstor nodes of α i along the path, i.e., ancestors(i) = {α 1 , ..., α i−1 }.We will construct a regularizer to push α i towards each of its parents.Then the regularization can be written as Under this pair-wise L 2 regularization, the updating rule for any leave node learning rate α L can be given by the following theorem Theorem 1.Under virtual approximation, effect of adding pair-wise L 2 regularization on different levels of adaptive learning rates 2 is equal to performing a weighted linear combination of virtual learning rates in different levels α * = n i γ i α i with n i γ i = 1, where each component α i is calculated by assuming there is no regularization.
Remarks: Theorem 1 actually suggests that the similar updating rule can be obtained for the learning rate at the any level on the path.All these have been demonstrated in Algorithm 1 for the three level case.
Proof.Consider the learning regularizer To apply hyper-gradient descent method to update the learning rate α L at level L, we need to work the derivative of L lr reg with respect to α L , the terms in ( 27) involving α L are only (α i − α j ) 2 where α j is an ancestor on the path from the root to the leave node α L .Hence As there are exactly L − 1 ancestors on the path, we can simply use the index j = 1, 2, ..., L − 1.The corresponding updating function for α n,t is: where This form satisfies α * L = L j=1 γ j αj with L j=1 γ j = 1.This completes the proof.

Prospective of learning rate combination
Motivated by the analytical derivation in Section 3.3, we can consider the combination of adaptive learning rates in different levels as a substitute of regularization on the differences of learning rates.As a simple case, the combination of global-wise and layer-wise adaptive learning rates can be written as: where γ 1 + γ 2 = 1 and γ 1 ≥ 0, γ 2 ≥ 0. In a general form, assume that we have n levels, which could include global-level, layer-level, unit-level and parameter-level, etc, we have: In a more general form, we can implement non-linear models such as neural networks to model the final adaptive learning rates with respect of the learning rates in different levels.
where θ is the vector of parameters of the non-linear model.In this study, we treat the combination weights {γ 1 , ..., γ n } as trainable parameters as demonstrated in Eq. (32). Figure 1 gives an illustration of the linear combination of three-level hierarchical learning rates.In fact, we only need these different levels of learning rate have a hierarchical in feed-forward neural networks, we can use parameter level, unit-level, layer level and global level.For recurrent neural networks, the corresponding layer level can either be the "layer of gate" within the cell structure such as LSTM and GRU, or the whole cell in a particular RNN layer.Especially, by "layer of gate" we mean the parameters in each gate of a cell structure share a same learning rate.Meanwhile, for convolutional neural network, we can further introduce "filter level" to replace layer-level if their is no clear layer structure, where the parameters in each filter will share a same learning rate.
As the real learning rates implemented in model parameter updating is a weighted combination, the corresponding Hessian matrices cannot be directly used for learning rate updating.If we take the gradients of the loss with respect to the combined learning rates, and use this to update the learning rate for each parameter, the procedure will be reduced to parameter-wise learning rate updating.To address this issue, we first break down the gradient by the combined learning rate to three levels, use each of them to updated the learning rate in each level, and then calculate the combination by the updated learning rates.Especially, h p,t , h l,t and h(g, t) are calculated by the gradients of model losses without regularization, as is shown in Eq. ( 34).
where h t = l h l,t = p h p,t and h l,t = p∈lth layer h p and f (θ, α) corresponds to the model loss L model (θ, α) in Section 3.2.Algorithm 1 is the full updating rules for the newly proposed optimizer with three levels, which can be denoted as combined adaptive multilevel hyper-gradient descent (CAM-HD).where we introduce the general form of gradient descent based optimizers (Reddi et al., 2019;Luo et al., 2018).For SGD, φ t (g 1 , ...g t ) = g t and ψ t (g 1 , ...g t ) = 1, while for Adam, φ t (g 1 , ... . Notice that in each updating time step of Algorithm 1, we re-normalize the combination weights γ 1 , γ 2 and γ 3 to make sure that their summation is always 1 even after updating with stochastic gradient-based methods.An alternative way of doing this is to implemented softmax, which require an extra set of intermediate variables c p , c l and c g following: γ p = softmax(c p ) = exp cp /(exp cp + exp c l + exp cg ), etc.Then the updating of γs will be convert to the updating of cs during training.In addition, the training of γs can also be extended to multi-level cases, which means we can have different combination weights in different layers.For the updating rates β p , β l and β g of the learning rates in different level, we set: where β is a shared parameter.This setting will make the updating steps of learning rates in different levels be in the same scale considering the difference in the number of parameters involved in h p,t , h l,t , h g,t .If we take average based on the number of parameters in Eq. ( 34) at first, this adjustment is not required.CAM-HD is a higher-level adaptation approach, which can be applied with any gradientbased updating rules and advanced adaptive optimizers.For exmaple, it can be merged with Adabound by adding a parameter-wise clipping procedure (Luo et al., 2018): Algorithm 1: Updating rule of three-level CAM-HD input: α 0 , β, δ, T initialization: where α * is the final step-size by original CAM-HD, η l and η u are the lower and upper bounds in adabound.η t can be applied in replacing α * p,t / √ V t in our algorithm for merging two methods to so called "Adabound-CAM-HD".In the experiment part, we will follow the original paper to set η l (t) = 0.1 − 0.1 (1−β 2 )t+1 and η u (t) = 0.1 + 0.1 (1−β 2 )t+1 for both Adabound and Adabound-CAM-HD.

Convergence analysis:
The proposed CMA-HD is not an independent optimization method, which can be applied in any kinds of gradient-based methods.Its convergence properties highly depends on the base optimizer that is applied.Here we provide an analysis based on the general prospective of learning rate adaptation (Baydin et al., 2017;Karimi et al., 2016).We have learned that for global-wise learning rate adaptation, if we assume that f is convex and L-Lipschitz smooth with ∇f (θ) < M for some fixed M and all θ, the learning rate α t satisfies: where α 0 is the initial value of α, and β is the updating rate for hyper-gradient descent.By introducing κ p,t = τ (t)α * p,t + (1 − τ (t))α 0 , where the function τ (t) is selected to satisfy τ (t) → 0 as t → ∞, we have the following convergence theorem.
Theorem 2. Convergence under certain assumptions about f Suppose that f is convex and L-Lipschitz smooth with ∇f (θ) < M for some fixed M and all θ.Then θ t → θ * if α ∞ < 1/L and t • τ (t) → 0 as t → ∞, where the θ t are generated accroding to (nonstochastic) gradient descent.
The proposed CMA-HD is not an independent optimization method, which can be applied in any kinds of gradient-based updating rules.Its convergence properties highly depends on the base optimizer that is applied.By referring the discussion on convergence in (Baydin et al., 2017), if we introduce κ p,t = τ (t)α * p,t + (1 − τ (t))α ∞ , where the function τ (t) is selected to satisfy tτ (t) → 0 as t → ∞, and α ∞ is a selected constant value.Then we demonstrate the convergence analysis for the three level case in the following theorem, where ∇ p is the the gradient of target function w.r.t. a model parameter with index p, ∇ l is the average gradient of target function w.r.t. a parameters in a layer with index l, and ∇ g is the global average gradient of target function w.r.t.all model parameters.
Theorem 3 (Convergence under mild assumptions about f ).Suppose that f is convex and L-Lipschitz smooth with ∇ p f (θ) < M p , ∇ l f (θ) < M l , ∇ g f (θ) < M g for some fixed M p , M l , M g and all θ.Then θ t → θ * if α ∞ < 1/L where L is the Lipschitz constant for all the gradients and t • τ (t) → 0 as t → ∞, where the θ t are generated according to (non-stochastic) gradient descent.
Proof.We take three-level's case discussed in Section 3 for example, which includes global level, layer-level and parameter-level.Suppose that the target function f is convex, L-Lipschitz smooth in all levels, which gives for all θ 1 and θ 2 : and its gradient with respect to parameter-wise, layer-wise, global-wise parameter groups satisfy ∇ p f (θ) < M p , ∇ l f (θ) < M l , ∇ g f (θ) < M g for some fixed M p , M l , M g and all θ.Then the effective combined learning rate for each parameter satisfies: where θ p,i refers to the value of parameter indexed by p at time step i, θ l,i refers to the set/vector of parameters in layer with index l at time step i, and θ g,i refers to the whole set of model parameters at time step i.In addition, n p and n l are the total number of parameters and number of the layers, and we have applied 0 < γ p , γ l , γ g < 1.This gives an upper bound for the learning rate in each particular time step, which is O(t) as t → ∞.By introducing κ p,t = τ (t)α * p,t + (1 − τ (t))α ∞ , where the function τ (t) is selected to satisfy tτ (t) → 0 as t → ∞, so we have κ p,t → α ∞ as t → ∞.If α ∞ < 1 L , for larger enough t, we have 1/(L + 1) < κ p,t < 1/L, and the algorithm converges when the corresponding gradient-based optimizer converges for such a learning rate under our assumptions about f .This follows the discussion in (Karimi et al., 2016;Sun, 2019).
When we introduce κ p,t instead of α * p,t in Algorithm 1, the corresponding gradients ∂L(θ) ∂α * p,t−1 will also be replaced by Theorem 4 (Convergence of Adabound-CAM-HD).Let {θ t } and {V t } be the sequences obtained from the modified Algorithm 1 for Adabound-CAM-HD discussed in Section 3.4.The optimizer parameters in Adam satisfy β 1 = β 11 , β 1t ≤ β 1 for all t ∈ [T ] and β 1 < √ β 2 .Suppose f is a convex target function on Θ, η l (t) and η u (t) are the lower and upper bound function, T ] and θ ∈ Θ.For θ t generated using Adabound-CAM-HD algorithm, the regret function Due to the clipping procedure in Eq. ( 36), the η t for parameter updating satisfies Hence, the proof of convergence of Adabound in (Luo et al., 2018) is also valid for Adabound-CAM-HD, ensuring that it achieves a high level of adaptiveness with a good convergence property.Notice that in (Savarese, 2019), it is recommended to suppose t η l (t) − t−1 ηu(t−1) ≤ M for all t ∈ [T ] as a correction.As the effective parameter-wise updating rates and corresponding gradients may change after clipping, the updating rules for other variable should be adjusted accordingly.

Experiments
We use the feed-forward neural network models and different types of convolutions neural networks on multiple benchmark datasets to compare with existing baseline optimizers.For each learning task, the following optimizers will be applied: (a) standard baseline optimizers such as Adam and SGD; (b) hyper-gradient descent in (Baydin et al., 2017); (c) L4 stepsize adaptation for standard optimizers (Rolinek and Martius, 2018); (d) Adabound optimizer (Luo et al., 2018); (e) RAdam optimizer (Liu et al., 2019); and (f) the proposed adaptive combination of different levels of hyper-descent.The implementation of (b) is based on the code provided with the original paper.One NVIDIA Tesla V100 GPU with 16G Memory 61 GB RAM and two Intel Xeon 8 Core CPUs with 32 GB RAM are applied.The program is built in Python 3.5.1 and Pytorch 1.0 (Subramanian, 2018).For each experiment, we provide both the average curves and standard error bars for ten runs.

Hyper-parameter Tuning
To compare the effect of CAM-HD with baseline optimizers, we first do hyperparameter tuning for each learning task by referring to related papers (Kingma and Ba, 2015;Baydin et al., 2017;Rolinek and Martius, 2018;Luo et al., 2018) as well as implementing an independent grid search (Bergstra et al., 2011;Feurer and Hutter, 2019).We mainly consider hyper-parameters including batch size, learning rate, and other optimizer parameters for models with different architectures.Other settings in our experiments follow open-source benchmark models.The search space for batch size is the set of {2 n } n=3,...,9 , while the search space for learning rate, hyper-gradient updating rate and combination weight updating rate (CAM-HD-lr) are {10 −1 , 10 −2 , ..., 10 −4 }, {10 −1 , 10 −2 , ..., 10 −10 } and {0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001}, respectively.The selection criterion is the 5fold cross-validation loss by early-stopping at the patience of 3 (Prechelt, 1998).The optimized hyper-parameters for the tasks in this paper are given in Table 1.For training ResNets with SGDN, we will apply a step-wise learning rate decay schedule as in (Luo et al., 2018;Liu et al., 2019).Notice that although the hyper-parameters are tuned, it does not mean that the model performance is sensitive to each hyper-parameter.For training ResNets with SGDN, we will apply a step-wise learning rate decay schedule as in (Luo et al., 2018;Liu et al., 2019).Notice that although the hyper-parameters are tuned, it does not mean that the model performance is sensitive to each of them.

Combination Ratio and Model Performances
First, we perform a study on the initialization of the combination weights different level learning rates in the framework of CAM-HD.The simulations are based on image classification tasks on MNIST and CIFAR10 (LeCun et al., 1998;Krizhevsky and Hinton, 2012).We use full training sets of MNIST and CIFAR10 for training and full test sets for validation.One feed-forward neural network with three hidden layers of size [100, 100, 100] and two convolutional network models, including LeNet-5 (LeCun et al., 2015) and ResNet-18 (He et al., 2016), are implemented.In each case, two levels of learning rates are considered, which are the global and layer-wise adaptation for FFNN, and global and filter-wise adaptation for CNNs.For LeNet-5 and FFNN, Adam-CAM-HD with fixed and trainable combination weights is implemented, while for ResNet-18, both Adam-CAM-HD and SGDN-CAM-HD with fixed and trainable combination weights are implemented in two independent simulations.We change the initialized combination weights of two levels following findings: First, usually the optimal performance is neither at full global level nor full layer/filter level, but a weighted combination of two levels of adaptive learning rates, for both update and no-update cases.Second, CAM-HD methods outperform baseline Adam/SGDN methods for most of the combination ratios initializations.Third, updating of combination weights is effective and helpful in achieving better performance than applying fixed combination weights.This supports our analysis in Section 3.3.Also, in real training processes, it is possible that the learning in favor of different combination weights in various stages and this requires the online adaptation of the combination weights.

Feed Forward Neural Network for Image Classification
This experiment is conducted with feed-forward neural networks for image classification on MNIST, including 60,000 training examples and 10,000 test examples.We use the full training set for training and the full test set for validation.Three FFNN with three different hidden layer configurations are implemented (Svozil et al., 1997;Fine, 2006), including [100, 100], [1000,100], and [1000,1000].Adaptive optimizers including Adam, Adabound, with two hyper-gradient updating rates, and proposed Adam-CAM-HD are applied.For Adam-CAM-HD, we apply three-level parameter-layer-global adaptation with initialization of γ 1 = γ 2 = 0.3 and γ 3 = 0.4, and two-level layer-global adaptation with γ 1 = γ 2 = 0.5.No decay function of learning rates is applied.Figure 3 shows the validation accuracy curves for different optimizers during the training process of 30 epochs.We can learn that both the two-level and three-level Adam-CAM-HD outperform the baseline Adam optimizer with optimized hyper-parameters significantly.For Adam-HD, we find that the default hyper-gradient updating rate (β = 10 −7 ) for Adam applied in (Baydin et al., 2017) is not optimal in our experiments, while an optimized one of 10 −9 can outperform Adam but still worse than Adam-CAM-HD with β = 10 −7 .
The test accuracy of each setting and the corresponding standard error of the sample mean in 10 trials are given in Table 2.

Lenet-5 for Image Classification
The second experiment is done with LeNet-5, a classical convolutional neural network without involving many building and training tricks (LeCun et al., 2015).We compare a set of adaptive Adam optimizers including Adam, Adam-HD, Adam-CAM-HD, Adabound, RAdam and L4 for the image classification learning task of MNIST, CIFAR10 and SVHN (Netzer et al., 2011).For Adam-CAM-HD, we apply a two-level setting with filter-wise and global learning rates adaptation and initialize γ 1 = 0.2, γ 2 = 0.8.We also implement an exponential decay function τ (t) = exp(−rt) as was discussed in Section 3.5 with rate r = 0.002 for all the three datasets, while t is the number of iterations.For L4, we implement the recommended L4 learning rate of 0.15.For Adabound and RAdam, we also apply the recommended hyper-parameters in the original papers.The other hyperparameter settings are optimized in Section 4.1.As we can see in Figure 4, Adam-CAM-HD again shows the advantage over other methods in all the three sub-experiments, except MNIST L4 that could perform better in a later stage.The experiment on SVHN indicates that the recommended hyper-parameters for L4 could fail in some cases with unstable accuracy curves.RAdam and Adabound outperform baseline Adam method on MNIST, while Adam-HD does not show a significant advantage over Adam with optimized hypergradient updating rate that is shared with Adam-CAM-HD.The corresponding summary of test performance is given in Table 3, in which the test accuracy of Adam-CAM-HD outperform other optimizers on both CIFAR10 and SVHN.Especially, it gives significantly better results than Adam and Adam-HD for all the three datasets.

ResNet for Image Classification
In the third experiment, we apply ResNets for image classification task on CIFAR10 (He et al., 2016;DeVries and Taylor, 2017) following the code provided by github.com/kuangliu/pytorch-cifar.We compare Adam and Adam-based adaptive optimizers, as well as SGD with Nestorov momentum (SGDN) and corresponding adaptive optimizers for training both ResNet-18 and ResNet-34.For SGDN methods, we apply a learning rate schedule, in which the learning rate is initialized to a default value of 0.1 and reduced to 0.01 or 10% (for SGDN-CAM-HD) after epoch 150.The momentum is set to be 0.9 for all SGDN methods.For Adam-CAM-HD SGDN-CAM-HD, we apply two-level CAM-HD with the same setting as the second experiment.We also implement Adabound-CAM-HD discussed in Section 3.4 by sharing the common parameters with Adabound.In addition, we apply an exponential decay function with a decay rate r = 0.001 for all the CAM-HD methods.The learning curves for validation accuracy, training loss, and validation loss of ResNet-18 and ResNet-34 are shown in Figure 5.We can see that the validation accuracy of Adam-CAM-HD reaches about 90% in 40 epochs and consistently outperforms Adam, L4 and Adam-HD optimizers in a later stage.The L4 optimizer with recommended hyper-parameter and an optimized weight-decay rate of 0.0005 (instead of 1e-4 applied in other Adam-based optimizers) can outperform baseline Adam for both ResNet-18 and ResNet-34, while its training loss outperforms all other methods but with potential overfitting.Adam-HD achieves a similar or better validation accuracy than Adam with an optimized hyper-gradient updating rate of 10 −9 .RAdam performs slightly better than Adam-CAM-HD in terms of validation accuracy, but the validation cross-entropy of both RAdam and Adabound are outperformed by our method.Also, we find that in training ResNet-18/34, the validation accuracy and validation loss of SGDN-CAM-HD slightly outperform SGDN in most epochs even after the resetting of the learning rate at epoch 150.The test performances (average accuracy and standard error) of different optimizers for ResNet-18 and ResNet-34 after 200 epoch of training are shown in Table 4 1 .We can learn that for both ResNet-18 and ResNet-34, the proposed CAM-HD methods (Adam-CAM-HD, Adabound-CAM-HD and SGDN-CAM-HD) can improve the corresponding baseline methods (Adam, Adabound and SGDN) with statistical significance.Especially, Adabound-CAM-HD outperforms both Adam-CAM-HD and Adabound.

Discussion
The experiments on both small models and large models demonstrate the advantage of the proposed method over baseline optimizers in terms of validation and test accuracy.One explanation of the performance improvement of our method is that it achieves a higher level of adaptation by introducing hierarchical learning rate structures with learnable combination weights, while the over-parameterization of adaptive learning rates is controlled by its intrinsic regularization effects.In addition, experiments show that the performance improvement does not require tuning the hyper-parameters independently if the task or model is similar.For example, the hyper-gradient updating rate for LeNet-5, ResNet-18 and ResNet-34 are all set to be 1e-8 in our experiments no matter the dataset being learned.Also, the hyper-parameter CAM-HD-lr is shared among each group of models (FFNNs, LeNet-5, ResNets) for all datasets being learned.For the combination ratio, γ 1 = 0.2, γ 2 = 0.8 works for all our experiments with convolutional networks.However, as the loss surface with respect to the combination weights may not be convex for deep learning models, the learning of combination weights may fall into local optimal.Therefore, it is possible that several trials are needed to find a good initialization of combination weights although the learning of combination weights works locally (Feurer and Hutter, 2019).In general, the selected hyper-parameters are transferable to a similar task for an improvement from the corresponding baseline, while the optimal hyper-parameter setting may shift a bit.The proposed CAM-HD method can also apply learning rate schedules in many ways to achieve further improvement.One example is our ResNet experiment on CIFAR10 with SGDN and SGDN-CAM-HD.For more advanced learning rate schedules (Lang et al., 2019;Ge et al., 2019), we can apply strategies like piece-wise adaptive scheme by re-initialize all the levels for different steps.Another method is to replace global level learning rate with scheduled learning rate, while adapting the combination weights and other levels continuously.

Learning of combination weights
The following figures including Figure 6, Figure 7, Figure 8 and Figure 10 give the learning curves of combination weights with respect to the number of training iterations in each experiments, in which each curve is averaged by 5 trials with error bars.Through these figures, we can compare the updating curves with different models, different datasets and different CAM-HD optimizers.
Figure 6 corresponds to the experiment of FFNN on MNIST in Section 3.3 of the main paper, which is a three-level case.We can see that for different FFNN architecture, the learning behaviors of γs also show different patterns, although trained on a same dataset.Meanwhile, the standard errors for multiple trials are much smaller relative to the changes of the average combination weight values.the same for the three optimizers, the values of γ 1 and γ 2 only change in a small proportion when training with Adam-CAM-HD, while the change is much more significant towards larger filter/layer-wise adaptation when SGD-CAM-HD or SGDN-CAM-HD is implemented.The numerical results show that for SGDN-CAM-HD, the average value of weight for layer-wise adaptation γ 1 jumps from 0.2 to 0.336 in the first epoch, then drop back to 0.324 before keeping increasing till about 0.388.For Adam-CAM-HD, the average γ 1 moves from 0.20 to 0.211 with about 5% change.In Figure 8, both the two subplots are about LeNet-5 models trained with Adam-CAM-HD, while the exponential decay rate for weighted approximation is set to be τ = 0.002.For the updating curves in Figure 8(a), which is trained on CIFAR10 with Adam-CAM-HD, the combination weight for filter-wise adaptation moves from 0.20 to 0.188.Meanwhile, for the updating curves in Figure 8(b), which is trained on SVHN, the combination weight for filter-wise adaptation moves from 0.20 to 0.195.Further exploration shows that τ has an impact on the learning curves of combination weights.As is shown by Figure 9, a smaller τ = 0.001 can result in a more significant change of combination weights during training with Adam-CAM-HD.The similar effect can also be observed from the learning curves of γs for ResNet-18, which is given in Figure 10 and we only take the first 8,000 iterations.Again, we find that in training ResNet-18 on CIFAR10, the combination weights of SGD/SGDN-CAM-HD change much faster than that of Adam-CAM-HD.There are several reasons for this effect: First, in the cases when γs do not move significantly, we apply Adam-CAM-HD, where the main learning rate (1e-3) is only about 1%-6% of the learning rate of SGD or SGDN (1e-1).
In Algorithm 1, we can see that the updating rate of γs is in proportion of alpha given other terms unchanged.Thus, for the same tasks, if the same value of updating rate δ is applied, the updating scale of γs for Adam-CAM-HD can be much smaller than that for plementation of chain rule, CAM-HD will take twice of the space for storing intermediate variables in the worst case.For two-level learning rate adaptation considering global and layer-wise learning rates, the extra space complexity by CAM-HD will be one to two orders' smaller than that of baseline model during training.

Time Complexity
In CMA-HD, we need to calculate gradient of loss with respect to the learning rates in each level, which are h p,t , h l,t and h g,t in three-level's case.However, the gradient of each parameter is already known during normal model training, the extra computational cost comes from taking summations and updating the lowest-level learning rates.In general, this cost is in linear relation with the number of differentiable parameters in the original models.Here we discuss the case of feed-forward networks and convolutional networks.
Recall that for feed-forward neural network the whole computational complexity is: where m is the number of training examples, n iter is the iterations of training, n l is the number of units in the l-th layer.On the other hand, when using three-level CAM-HD with, where the lowest level is parameter-wise, we need n layer element products to calculate h p,t for all layers, one n layer matrix element summations to calculate h l,t for all layers, as well as a list summation to calculate h g,t .In addition, two element-wise summations will also be implemented for calculating α p,t and α * p .Therefore, the extra computational cost of using CAM-HD is ∆T , where m b is the number of mini-batches for training.Notice that m/m b is the batch size, which is usually larger than 100.This extra cost is more than one-order smaller than the computation complexity of training a model without learning rate adaptation.For the cases when the lowest level is layer-wise, only one element-wise matrix product is needed in each layer to calculate h l,t .For convolutional neural networks, we have learned that the total time complexity of all convolutional layers is (He and Sun, 2015): where l is the index of a convolutional layer, and n conv layer is the depth (number of convolutional layers).n l is the number of filters in the l-th layer, while n l−1 is known as the num-ber of input channels of the l-th layer.s l is the spatial size of the filter.m l is the spatial size of the output feature map.If we consider convolutional filters as layers, the extra computational cost for CAM-HD in this case is ∆T (n) = O(m b •n iter n conv layer l=1 ((n l−1 •s 2 l +1)•n l )), which is still more than one order smaller than the cost of model without learning rate adaptation.
Therefore, for large networks, applying CMA-HD will not significantly increase the computational cost from the theoretical prospective.

Conclusion
In this study, we propose a gradient-based learning rate adaptation strategy by introducing hierarchical learning rate structures in deep neural networks.By considering the relationship between regularization and the combination of adaptive learning rates in multiple levels, we further propose a joint algorithm for adaptively learning each level's combination weight.It increases the adaptiveness of the hyper-gradient descent method in any single level, while over-parameterization involved in optimizers can be controlled by adaptive regularization effect.Experiments on FFNN, LeNet-5, and ResNet-18/34 indicate that the proposed methods can outperform the standard ADAM/SGDN and other baseline methods with statistical significance.

Figure 1 :
Figure 1: The diagram of a three-level learning rate combination

Figure 2 :
Figure2: The diagram of model performances trained by Adam/SGDN-CAM-HD with different combination ratios in the case of two-level learning rates adaptation.The x-axis is the ratio of global-level adaptive learning rates.ResNet-18s are trained for 10 epochs only.

Figure 3 :
Figure 3: The comparison of learning curves of FFNN on MNIST with different adaptive optimizers.

Figure 4 :
Figure 4: The comparison of learning curves of training LeNet-5 with different adaptive optimizers.

Table 1 :
Hyperparameter Settings for Experiments

Table 2 :
Summary of test performances with FFNNs.

Table 3 :
Summary of test performances with LeNet-5