AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

Various works have been published around the optimization of Neural Networks that emphasize the significance of the learning rate. In this study we analyze the need for a different treatment for each layer and how this affects training. We propose a novel optimization technique, called AdaLip, that utilizes an estimation of the Lipschitz constant of the gradients in order to construct an adaptive learning rate per layer that can work on top of already existing optimizers, like SGD or Adam. A detailed experimental framework was used to prove the usefulness of the optimizer on three benchmark datasets. It showed that AdaLip improves the training performance and the convergence speed, but also made the training process more robust to the selection of the initial global learning rate.


Introduction
Neural Networks produce state-of-the-art results for various research fields, such as image recognition [1,2], speech recognition [3,4], machine translation [5], autonomous driving [6], text generation [7] and many others.As more and more data become available, the need for deep neural networks becomes more evident.Deep networks can be trained through many learning algorithms, most of them being variations of Stochastic Gradient Descent (SGD).
The training task of a neural network can be represented as an optimization problem of finding the best parameters w * that minimize the loss function, f : d − → , given a set of training samples x: The update rule of SGD can be summarized as an iterative movement in the opposite direction of the gradient.That can be seen in the following equation: where α t is the learning rate.For simplicity, we'll often refer to the gradient as g t and the update as u t .The convergence and performance of SGD is greatly influenced by the learning rate, which is why it is one of the most important hyperparameters to fine-tune in a neural network.Selecting a value of α t larger than the optimal can lead to unpredictable oscillations of the learning curve and even to divergence.On the other hand, small values reduce the convergence speed and make the loss more prone to be trapped in a local minima [8,9].
The most common practice is to decrease the learning rate during training [10,11].However, there are many indications that this is not the best scheduling strategy [12].Numerous works have been published about the optimal learning rate.The best ones revolve around techniques that change it adaptively depending on various conditions and metrics.In this study a new method will be presented changing the learning rate of each layer based on the Lipschitz constant.

Related Work
As mentioned previously, the learning rate is one of the most important hyperparameters in Gradient Descent.Consequently there have been numerous studies aiming at identifying the optimal learning rate.The most common approaches focus on defining a strategy to change the learning rate during training.These are usually referred to as schedules.The earliest instance of such a schedule, is the Robbins/Monro theory [10], which states that the learning rate should be chosen to satisfy the following equations: Another scheduling scheme calls for starting at a relatively high learning rate to achieve fast convergence and then half that every few iterations to ensure small proximity to w * [13].Due to its simplicity, variants of this scheme have been employed to train some of the most popular architectures [14].Another strategy that has proven to be effective in practice is to start training with a constant learning rate and to decrease it by a factor of 2-10 once the loss stops decreasing [15].
A strong assumption usually made by optimization algorithms is that the loss function f is convex.In this setting, a good selection of learning rate is shown to be: which guarantees a convergence of E[ f (w t ) − f * ] ≤ O(log(t)/ √ t) without any smooth assumptions, if the base learning rate α 0 is chosen properly [16].[17] proves with a slightly different framework that the learning rate of Eq. 3 achieves a convergence of O(1/ √ t).
A strategy gaining significant popularity lately is scheduling the learning rate in a periodical fashion.More specifically, cyclical learning rate schedules (i.e. both increasing and decreasing the learning rate during training) have proven to be very effective in practice [8,9,12].This helps the network to escape bad local minima that the training process could be stuck.
One of the main disadvantages with using SGD as an optimizer is that it scales the update with the magnitude of the gradient in all directions.Sometimes this may lead to slower convergence rates and poor performance.Ideally, the optimization procedure would benefit from being able to choose different learning rates for different weights or set of weights of the network.For example, it could be useful to set higher learning rates for small gradients (or the opposite), when it is needed to reached a better point in the loss space [18].
To overcome this issue, a lot of methods have been proposed, that offer "adaptive" learning rates.The first such algorithm was AdaGrad [19], which adapts the learning rate with the accumulated squared gradient for each parameter individually.This has been shown to improve the convergence rate of SGD in non-convex settings, especially in highly sparse data, as it decreases the learning rate faster for frequent parameters, and slower for infrequent parameters.One major drawback, however, is that the adaptive learning rate tends to get small over time, due to the accumulation of gradients from the beginning of the training.
RMSProp [20] is another optimizer that aims to fix the aforementioned shortcomings of AdaGrad.Instead of accumulating all the squared gradients, it uses an exponential moving average.This is helpful because the moving average of the gradients does not get extremely large forcing the overall learning rate closer to zero.Another major improvement in optimization techniques came in the form of Adam [21], which in addition to scaling the learning rate with the moving average of squared gradients, also averages the gradients themselves.This optimizer has grown substantially in popularity and currently is the "default choice" in most deep learning frameworks [15,22].
This family of adaptive optimizers, however, are far from perfect and have received a lot of criticism as they lead to biased gradient updates which change the underlying optimization problem [18].In fact, it has been shown that in some cases, adaptive methods often find drastically different solutions compared to SGD [22].Other techniques have been proposed to alleviate the aforementioned issues of adaptive learning rates, namely AMSGrad [23] and AdaBound [24], which provide strong theoretical proofs of convergence.[25] proposes another variation of adaptive learning rates, where they theoretically derive the Lipschitz constant for neural networks with different types of loss functions.Another attempt to compute the Lipschitz constant was made by [26].However, none of these approaches, has yet to surpass the popularity of Adam.
Another approach for adapting the learning rate during training is to update it at each iteration (using backpropagation) in order to maximize some criterion.This is equivalent to "learning" the learning rate for some external goal, e.g. the minimization of the cost function [27], or the squared norm of its derivative [18].The first technique is interesting, however it falls under the same pitfall mentioned previously, of using a scalar learning rate.
In this paper a new adaptive method is proposed that computes the learning rate based on the Lipschitz constant.The main contributions of the present paper are the following: • A novel adaptive optimization technique that approximates the Lipschitz constant of the gradient of the loss function to estimate the optimal learning rate.• An empirical analysis indicating an heterogeneity of the magnitude of the gradients of different layers.This led us to the use of a different learning rate per layer.
• Through an experimental study we provide insights on neural network training as well as some recommended practices.
3 Theoretical Analysis

Preliminaries
Before describing our contributions, we will present a generic framework, which can represent all adaptive optimization methods.In the sequel, we will adopt standard notation and relevant mathematical techniques from the literature [21,23].We will use F ∈ d to denote the set of feasible points for the weights w t .We assume that the set F has bounded diameter D ∞ ∈ , if x − y ≤ D ∞ , for all x, y ∈ F .In the feasible set F , there is another assumption that . This generic framework is portrayed in Algorithm 1.

Algorithm 1 Generic framework of adaptive optimization methods
Input: w t ∈ F , initial step α 0 , functions {φ t , ψ t } T t=1 1: for t=1 to T do 2: m t = φ t (g 1 , ..., g t ) and V t = ψ t (g 1 , ..., g t ) 4: Table 1 summarizes how various algorithms fit with the generic framework.The main differences in these algorithms lie in the selection of the "averaging" functions φ and ψ, through which the parameters m t and V t are updated.Through this abstraction, the differences among the various optimizers become more apparent.

Background
and the smallest L that satisfies the above equation is called Lipschitz constant. 1All the norms discussed in this study are Euclidean, • = • 2 .
Let f : d → be a function with a smooth gradient: Consider an optimization problem, where the goal is to find the set of parameters w that minimize the loss function f (i.e.Eq. 1), through Gradient Descent (Eq.2).
Lemma 1 Given a convex loss function f , with a L-Lipschitz continuous gradient (Def. 1) with L being the Lipschitz constant, the optimal learning rate for Gradient Descent is: The proof for that can be found in the Appendix A. Computing the optimal learning rate, however, requires prior knowledge of the Lipschitz constant of the loss function's gradient, which is generally not the case.
In practise, it is difficult to find an overall good learning rate because that can change from dataset to dataset and from model to model.Also, because the majority of deep learning models are trained with variants of SGD, meaning noisy gradients, calculating the exact value of L is not easily feasible.If, however, the Lipschitz constant could be learned during training and its approximate value is accurate, then one could adapt the learning rate towards its optimal value as training progresses.Up till now this has remained an open research question [18].

Learning the Lipschitz Constant
Considering that the update steps are quite small for each iteration, we can calculate the constant L of a small subspace, the one that the optimizer explores, assuming that it is Lsmooth.This means that the information of the subspace of the loss function needed was computed during a single forward pass.A multiple forward pass scheme could be approached by computing the gradients of different nearby directions, but it costs extra training time making it extremely infeasible for large scale datasets and deep models.Below, we present the approximation of L in a stochastic environment.It can be derived from Eq. 4, if (w 1 , w 2 ) is substituted by (w t , w t−1 ): The analysis so far has been based around Gradient Descent.In a realistic scenario the stochastic version will take its place.This translates to the gradients being calculated from a subset of the dataset (i.e. a batch) and, as a result, from a subspace of the loss space.These gradients, g t , are noisy versions of the total gradient but their expected value is equal to the total: By renaming ∇ f (w t ) as g t and using Eq. 5, the above inequality transforms to this: where α * is the optimal learning rate and α t is the initial learning rate.Equation 6 offers a way of approximating the optimal value of the learning rate α * .Ideally, we would want to set the learning rate at each to the value indicated.This practice, however, isn't recommended, as this equation exhibits a high degree of variance from iteration to iteration, which would result in very unstable training.
One issue that arises is due to the denominator, which is the norm of the difference of the gradients of two iterations.This can be seen as the magnitude of change of the direction the optimizer chooses.The problem happens especially near local minima, where the updates are smaller and close to each other.In this situation the denominator might reach values close to zero, causing the learning rate to explode.A workaround is to add a small positive term, c t , to the denominator: This creates an artificial low bound to the denominator, which can change over the course of training.In the following sections, the impact of the hyperparameter c t to the training process, will be explored in more detail.
The instability of the algorithm, however, isn't solely attributed to the denominator; the stochastic nature of the algorithm also plays its part.Equation 7helps with approximating the optimal learning rate of the loss' subspace, visible through each batch x t .Our goal is to approximate the optimal learning rate of the underlying loss function, though.To achieve this goal, the moving average of α * for each batch is computed: with γ ∈ (0, 1) being the coefficient of the moving average.In closed form this can be written as: with S 0 = 1.Thus, the approximation A t of the optimal learning rate α * is calculated as a product of α t and the above exponential moving average:

Lemma 2 Consider a loss function f that satisfies Definition 1 with bounded gradients
Let M be the minimum norm of the gradients that is non-zero, then the moving average S t of Eq. 8 is bounded.
The above lemma proves that the moving average of Eq. 8 is bounded and, as a result, the learning rate A t is also bounded.This will be helpful for the next theorem to prove the convergence of a SGD optimizer with such learning rate.For the proof of convergence, the term of regret will be used.Regret (R) is the sum of all previous differences between the current network's prediction f t (w t ) and the best possible prediction f t (w * ), obtained from the optimal set of weights w * .The goal of the proof is to show that the regret averaged over time reaches zero.The regret can be written in the following form: Theorem 1 Assume that the loss function has bounded gradients, g t ≤ G, minimum nonzero norm of gradients M and the weights are in the sphere w t ≤ r.Let α t = α 0 / √ t and γ ∈ (0, 1), then an SGD optimizer with Eq. 9 as its learning rate achieves the following guarantees for all T > 1 and c t ≥ The proofs for both Lemma 2 and Theorem 1 can be found in the Appendix A.

Motivation
Usually in neural networks, parameters within the same layer share similar properties.From layer to layer, though, they might differ.An occasion where this is important is the case of vanishing gradients [28], where the early layers of the network won't be trained adequately.
While this problem has been addressed successfully with non-saturating activation functions [29], better weight initialization strategies [30] and transformation layers [28], we can see, in practice, none of the above solve the issue directly.
In order to have a better insight of the training process and how layers differ from each other, an experimental scheme was performed in order to observe the magnitude of the gradients and the magnitude of the updates in 3 different datasets comparing two optimizers, SGD and Adam with steady learning rates.To measure that we computed the mean of the absolute values of the gradients and the updates of each layer of the network for every iteration.The results are shown in Fig. 1.The networks that were used are shown in Appendix B. The blue lines correspond to the earlier iterations of the training and as the colormap reaches the red color, it is the indication for the last iterations.The layers displayed are only Convolutional and Dense layers.We can see that the left two columns display the networks that were trained with SGD.As expected, the magnitude of the updates is proportional to the magnitude of the gradients and their shapes are quite similar.On the other hand, the networks trained by the Adam optimizer showed faster convergence and achieved higher accuracy.Analysing the two right columns of the Fig. 1, we can assess that the magnitude of the updates is not proportional to the magnitude of the gradients and does not mimic the shape of the gradients.This is an insight on what makes Adam better than SGD in this specific scenario.Adam Fig. 1 The magnitude of the gradients and the updates per layer for all datasets (MNIST, CIFAR10, CIFAR100) for two different optizizers (SGD and Adam).The blue lines depict the first iterations.As the color changes to red, it shows iterations closer to the end is able to construct updates closer to the optimal ones independently of the magnitude of the gradient.SGD usually suffers from that, meaning that some layers in the middle might display low gradients, hindering the path to the optimal updates.This shows the usefulness of an adaptive learning rate per layer that will help any optimizer, like SGD or Adam, scale the gradients and, in consequence, the updates in a way that it will make it easier to reach the optimal weights.
All the weights and gradients that were mentioned in the previous equations of Sect. 3 were the full vectors containing every parameter of the network.However, in a realistic setting it is difficult to construct one large vector of this size (usually would have millions of dimensions) and perform calculations, such as multiplications, computing norms, etc.This is the second reason why it is preferable to compute the updates of the network for every layer separately.This led to the motivation of constructing an optimizer that employs a different adaptive learning rate per layer.In the next section, the full algorithm will be displayed and how it fits with other optimizers.

Proposed Optimization Framework
This paper proposes an optimization method, called AdaLip, which adapts the learning rate per layer.This is helpful because it can alleviate issues that occur from underfitted layers, while being able to work with any scheduler that monitors the global learning rate α t .Additionally, it can efficiently work on top of existing optimizers that use an adaptive learning rate per parameter, e.g.Adam.This study focuses in the conjunction between AdaLip and three of the most popular optimizers in deep learning: SGD, Adam and RMSProp.A second variation of AdaLip, with a slight modification in its update rule, will also be introduced in Appendix C.

AdaLip
Algorithms 2, 3 and 4 show the three new optimizers that are created.The first is the vanilla version of AdaLip, where the algorithm adapts the learning rates of a SGD optimizer.As discussed previously, the algorithm estimates the Lipschitz constant for each batch and, through a moving average, approximates the optimal learning rate A t (Eq.9).This is done for each layer independently for the reasons mentioned in Sect.3.4.It is important to note that the norms displayed in the Algorithms 2, 3 and 4 are computed using only the gradients of the weights that are in the same layer with the weight that is being updated in that iteration.

Algorithm 2 AdaLip
Input: w t ∈ F , initial step α 0 , γ ∈ (0, 1), c t Initialize: S 0 = 1, g 0 = 0 1: for t=1 to T do 2: The AdaLip methodology can be applied on top of existing optimizers.One such instance is AdaLip + RMSProp, which will be referred to as RMSLip and is described in Algorithm 3. Again, the optimal learning rate is estimated and is applied for each layer separately.The difference, here, is that each parameter is scaled by the moving average of the squares of its past gradients, as dictated by the RMSProp update rule.

Algorithm 3 RMSLip
Input: w t ∈ F , initial step α 0 , γ ∈ (0, 1), c t Initialize: S 0 = 1, g 0 = 0, ν 0 = 0 1: for t=1 to T do 2: Finally, we'll examine the combination of AdaLip and Adam, which will be referred to as AdamLip.The update rule of Adam is left unchanged (i.e. the moving averages of the gradients m t and their squares ν t and their bias-correction), however in this version, the constant learning rate is substituted with A t (Eq.9).

Implementation Details
This Section will explain how the hyperparameter values for AdaLip were chosen and initialized.
From Theorem 1 the optimal c t has been calculated in order to have a guarantee of convergence.In practice, the experiments were run with a constant c t = c = 10 −8 , which is close to the theoretical c t .This will be discussed in more detail in Sect.6.2.
For the initial learning rate α 0 we selected many values in order to test its robustness and it will be further analyzed in Sect. 5.The initial value of the moving average S 0 is set to one, because then the overall learning rate will start at the initial learning rate α 0 .
Another important hyperparameter to tune is the coefficient γ of the moving average.Its range is (0, 1) and the proposed value is 0.8, which was found empirically.Lower values tend to affect the learning rate to change drastically from iteration to iteration.This could be beneficial to overcome bad local minima that the the network can get stuck, but it also makes the algorithm unstable.On the other hand, higher values will smooth out the overall learning rate containing any possible spikes.

Experimental Framework
In order to accurately capture the performance of this novel optimization technique, three benchmark datasets were used, MNIST [31], CIFAR10 [32] and CIFAR100 [32].AdaLip will be compared with its counterparts (i.e.AdaLip versus SGD, RMSLip vs RMSProp and AdamLip vs Adam) in terms of convergence speed and robustness to the selection of the initial learning rate a 0 .The goal is to determine whether or not this addition improves the training procedure of a Neural Network.
Because of the varying levels of difficulty in each task, a different architecture was used for each.These can be seen in Tables 6, 7 and 8, in Appendix B of the Appendix.The first two are small versions of the VGG [33] architecture, while the CIFAR100 one has a lot more depth and utilizes BatchNormalization [28] layers as well as Dropout [34].
For result stability, every setup for every optimizer was run multiple times (25 runs for MNIST and CIFAR10, 10 runs for CIFAR100).25 runs are enough to capture the overall variance of the different initializations of the networks for MNIST and CIFAR10.Regarding CIFAR100, we selected less runs because, due to the larger number of training epochs compared to the other two datasets, it was observed that the variance between runs was quite small.
In every run, the optimizers were initialized with the same weights and hyperparameters.Because different optimizers perform best with different values for the base learning rate, various values were tested for each optimizer and dataset.
To evaluate the results, we first found the peak training accuracy for each run and then took the median of those values over all 25 runs, for each learning rate independently.This was done because we consider the most successful optimizer to be the one that outperforms the others consistently, independent of random initialization.The next step of evaluation is to assess how sensitive is the optimizer to the selection of the base learning rate.For this reason we took the max, mean, median and std out of all the different learning rates experiments.This is displayed in Tables 2, 3 and 4 for each dataset, which will be discussed below.Another important feature of an optimizer is the speed of convergence.That can be measured with a lot of ways.In this study the metric was the number of epochs needed to reach 95% of the maximum training peak.The maximum training is computed out of all runs and different learning rate experiments.Similar with the accuracy measurement, the mean, median and best (lowest) Epoch of convergence (98% of maximum peak for MNIST and 95% for both CIFAR) were computed to capture a more complete image of the training process.

MNIST
First, AdaLip was tested on MNIST, a dataset of small grayscale images of handwritten digits, with a size of 28 × 28.It consists of 60000 training images and 10000 for testing.The The bold refers to which is the best The bold refers to which is the best Fig. 2 Mean Curves out of all learning rates on MNIST for Adam and SGD based optimizers images were normalized to the range of [0, 1] and no further preprocessing or augmentation was added.
AdaLip in conjunction with every optimizer seems to perform better in both max and mean performance.Also, it has lower std meaning that the results are more stable and fluctuate less with the change of learning rate.Regarding the Epochs of convergence we see a slight improvement in mean and median of the Epochs of conversion.Considering that MNIST is an easy to train dataset and it converges extremely fast, there is not any room for big improvements to achieve.For a closer look, the mean curve of all learning rates for SGD and Adam based optimizers is displayed in Fig. 2. The shading represents the variance from different learning rate experiments.The AdaLip versions seem to converge faster and reach higher accuracies.

CIFAR10
CIFAR10 was the second dataset to test our methodology.CIFAR10 consists of colored images with 32 × 32 size.The training set has 50000 images and the test set 10000, while they are divided in 10 classes.The images were normalized as in MNIST between [0, 1] and no further preprocessing was added.In this task, the SGD-based optimizers were examined for 10 learning rates (i.e.0.005, 0.01, 0.05, 0.07, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0), Adam-based ones for 7 learning rates (i.e.0.0002, 0.0003, 0.0005, 0.0007, 0.001, 0.005, 0.01) and RMSProp-based ones for 5 (i.e.0.0003, 0.0005, 0.0007, 0.001, 0.005).The same procedure was followed here, which yielded similar results with MNIST.AdamLip and RMSLip outperformed their counterparts on mean and max scores.AdaLip scored 0.5% lower in max performance but the mean and median results were quite an improvement compared to SGD.Regarding the std scores RMSLip shows great stability while the rest display similar behavior.As far as the Epochs of convergence analysis, Adam and SGD are slight ahead (by 1 epoch or less on average) but RMSLip is faster than its counterpart.The mean curve of all learning rates for SGD and Adam based optimizers is plotted in Fig. 3.It is clear that the AdaLip versions converge faster and to higher accuracy.

CIFAR100
The last dataset used for testing the proposed framework was CIFAR100.It has the same images as CIFAR10, but, instead of 10 classes, the dataset is divided in 100.This makes it a lot more difficult to train a neural network and to achieve a good generalization score.To help with the generalization, image augmentation was performed after the normalization.The augmentation techniques were random rotations, flips and width/height shifts that help with the better training of the network.

Generalization Performance
The focus of this paper is mainly revolving around the training performance of the optimizers, but, the generalization performance is extremely important as well.In order to measure the testing scores of each optimizer, we computed the mean of the maximum test accuracy of each run of the aforementioned experimental process.Meaning that for each optimizer and  The bold refers to which is the best for each learning rate a mean testing accuracy is extracted for the three benchmark datasets.Table 5 displays the best test accuracies of the optimizers among all the starting learning rates for each case.We can see that in most cases the scores are close, but AdaLip's version display an improvement.Specifically, AdamLip's performance is slightly better than Adam in all 3 datasets.RMSProp seems to achieve better generalization in CIFAR10 than RMSLip, but on the the other datasets the improvement is smaller.Lastly, regarding SGD and AdaLip, AdaLip seems to perform better in all 3 datasets.Overall, the test scores show a slight advantage for the AdaLip-based optimizers.

Discussion and Future Work
In this section a discussion will be held about various aspects of the proposed algorithm, the theoretical analysis as well as some insights on the whole training process of DNNs.

Learning Rates Per Layer
As previously mentioned in Sect.3.4, the norms of the weights of the layers follow a U-like shape.Specifically, the first and last layers tend to exhibit larger norms.In Figs. 5, 6 and 7 we can see that in most situations the algorithm chose a different learning rate for different layers, which seems to validate our initial intuition.It is important to note that the AdaLip and AdamLip optimizers show a wider variance over time compared to RMSLip.Another observation is that despite the norms of the weights have a fixed tendency the learning rates per layer do not.This means that the learning rate successfully adapts to the needs of each architecture and dataset uniquely.

Theoretical versus Practical c t
One of the main discussion points is the selection of c t and how it stands between the two most important features of an optimizer, the theoretical guarantee of convergence and the overall performance.As mentioned in the Related Work section one of the most used schedules for a guarantee in convergence is dividing the initial learning rate by the square root of the number of the iterations (see Eq. 3).However, in practice this is not preferred.The decay applied by the square root is too great for any real application that needs thousands of iterations to reach a good performance.With that many iterations the learning rate will become extremely small at an early stage of the training resulting in converging to a bad local minimum.On the other hand, a more steady learning rate seems to provide better results.This can be seen in Fig. 8 where constant learning rate achieves better convergence than the theoretical one.
Recent studies [8,9,12] show that an oscillating learning rate performs in some cases better than a strictly decreasing one, because it helps the optimizer overcome local minima.
Similarly to the case of c t , the theoretical value has the same performance as a constant value, except for some cases where constant c t has a slight edge over the theoretical one.This can be seen in Fig. 9, where all runs with the same learning rate are really close, but constant c t shows a small improvement.This trade-off between the guarantee of convergence and better results is quite important and can be rephrased as a trade-off of stability versus peak performance.One of the reasons that this behavior exists lies in the nature of SGD.In order to prove that SGD's (or its variants) Regret converges to 0 over time, various assumptions are being made about the loss landscape.In practice, though, the landscape may not be ideal and converging to the first local minimum might not be satisfactory [35,36].Naturally, escaping saddle points and local minima has been a focal point in optimization research [11,37,38].

SGD Based Equation for Optimal LR
In order to find the optimal learning rate we used Lemma 1, which is based on Gradient Descent.However, we have to mention the reason that the optimal learning rate based on SGD was not used.Various experiments showed that the equation derived from SGD leads to a quite unstable learning progress.The times when the network was trained normally, the convergence speed was notably slower to the point that no amount of fine-tuning could have Fig. 10 Magnitude of gradients and updates on CIFAR100 of the BatchNormalization layers (regarding gamma and beta weights).The blue lines depict the first iterations.As the color changes to red, it shows iterations closer to the end made it competitive.In the Appendix in Lemma 6 there is the full formula for the optimal learning rate based on SGD.It would be very interesting for future work to see if there is a way to apply this efficiently.

Impact of Batch Normalization
Another important observation about the norms of the layers arises with the use of the Batch Normalization.As can be seen from Fig. 10, the magnitude of the BatchNormalization weights does not always follow the pattern shown earlier in Fig. 1.This means that the gamma and beta weights of the BatchNormalization layers display a different behaviour compared with the rest of the weights.Although Batchnormalization helps the training to smooth the gradients of the layers [28], it creates weights that suffer from the same problem.This is the reason that AdaLip is able to construct a suitable learning rate for all these parameters.This is enforced by the fact that in practice, with networks that apply Batch Normalization (Sec.5.3), AdaLip seems to perform better than other optimizers.

7 Conclusion
To summarize, this paper proposes a novel optimization method, called AdaLip, which uses the theoretical optimal learning rate based on the Lipschitz constant to produce an adaptive learning rate per layer.The motivation behind the idea was an analysis on the different behaviors of different layers of a network.We show that this method helps with the overall training process of neural networks in convergence speed.This was supported by a number of experiments on three benchmark datasets.An advantage to the proposed technique is that it can work together with various existing optimizers, such as Adam or RMSProp, while being more robust in selecting the starting learning rate.
In order to guarantee the decrease of the function f during the iterations, then the following must be true: To maximize the decrease of function f per iteration, the derivative of f in regard to α must be zero: Proof of Lemma 2 From Eq. 8, S t can be written as: For the upper bound we take: The sum on the right is the sum of a geometric series and because γ = 1 then it can be transformed to: For the lower bound we use : Computing the same geometric series we get: which concludes the proof of the upper and lower bounds of S t .

Proof of Theorem 1
From the projected stochastic subgradient method w t+1 = D (w t − A t g t ), we have: From Lemma 3 we have: Summing from t=1,2,..,T to form the Regret, we have: Unfolding the sums we get: Because the gradients are bounded g t ≤ G and using lemma 4 we have: From lemma 5 we prove that with certain c t the difference is positive so we can use lemma 4 to get: Now the sum becomes telescopic: Clearly the behavior of A t affects the convergence of the whole algorithm.The left term shows that if A t decreases really fast (i.e.O(1/t 2 )) the algorithm diverges.On the other hand, if A t is constant or increases then the right term (the sum) will diverge.Lemma 5 shows that c t controls how fast A t decreases.There can be many c t that lead to convergence with various speeds.Here one of them is presented that satisfies the following equation: Using the bounds of S t from Lemma 2 we have: From the Eq.A.3 of c t it is clear that c t is non-decreasing, thus, the above inequality transforms into: Proof From the triangular inequality we have: Lemma 6 Given a convex loss function f , with an L-Lipschitz continuous gradient (Def.1), the optimal learning rate for Stochastic Gradient Descent is: Proof The update rule for SGD is the following: where f t is the partial f computed by a mini-batch at iteration t.Starting with same equations as Lemma 1 we have: Taking the expected value of both sides: Using the fact that E[∇ f t (w t )] = ∇ f (w t ): Taking the derivative equal to zero similar with Lemma 1: optimizers based on SGD, RMSProp and Adam, called AdaLip-U, RMSLip-U and AdamLip-U respectively.As shown in Fig. 11, AdamLip-U shows much potential in some cases, however it was much more unstable than its counterpart and more sensitive to the selection of the learning rate.In Algorithm 5, AdaLip-U is presented in detail.

Fig. 3
Fig. 3 Mean Curves out of all learning rates on CIFAR10 for Adam and SGD based optimizers

Fig. 4
Fig.4 Mean Curves out of all learning rates on CIFAR100 for Adam and SGD based optimizers

Fig. 8 Fig. 9
Fig. 8 Constant Learning rate vs Square root decay (with various learning rates) with Adam on CIFAR10

Lemma 3 ALemma 4
function f : d −→ is convex, then for all x, y ∈ d , f (y) ≥ f (x) + ∇ f (x) T (y − x) Given a weight w i ∈ d and w i ≤ r, then w n − w m 2 ≤ 4r 2 , ∀n, m.

Table 1
Differences in popular adaptive optimization methods and how the are structured based on the optimization framework of Algorithm 1

Table 2
Training set accuracy (top 4 rows) and Epochs of convergence (bottom 3 rows) for the MNIST dataset

Table 3
Training set accuracy (top 4 rows) and Epochs of convergence (bottom 3 rows) for the CIFAR10 dataset

Table 4
Training set accuracy (top 4 rows) and Epochs of convergence (bottom 3 rows) for the CIFAR100 dataset

Table 5 Test
w 2 n − 2w n w m + w 2 Let A t be the learning rate from Eq. 9, α t = α 0 / √ t.Then A t ≤ A t−1 is true, if c t is a positive function of t and satisfies the following inequality: m ≤ w n 2 + 2 w n w m + w m 2 ≤ r 2 + 2rr + r 2 = 4r 2 Lemma 5 t ≥ (1 − γ ) g t−1 t t−1 − γ S t−1 − g t − g t−1