Skip to main content
Log in

Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

Learning rates in stochastic neural network training are currently determined a priori to training, using expensive manual or automated iterative tuning. Attempts to resolve learning rates adaptively, using line searches, have proven computationally demanding. Reducing the computational cost by considering mini-batch sub-sampling (MBSS) introduces challenges due to significant variance in information between batches that may present as discontinuities in the loss function, depending on the MBSS approach. This study proposes a robust approach to adaptively resolve learning rates in dynamic MBSS loss functions. This is achieved by finding sign changes from negative to positive along directional derivatives, which ultimately converge to a stochastic non-negative associated gradient projection point. Through a number of investigative studies, we demonstrate that gradient-only line searches (GOLS) resolve learning rates adaptively, improving convergence performance over minimization line searches, ignoring certain local minima and eliminating an otherwise expensive hyperparameter. We also show that poor search directions may benefit computationally from overstepping optima along a descent direction, which can be resolved by considering improved search directions. Having shown that GOLS is a reliable line search allows for comparative investigations between static and dynamic MBSS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7657 LNCS, pp. 216–223 (2012). https://doi.org/10.1007/978-3-642-35395-6_30

  2. Anitescu, M.: Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim. 10(4), 1116–1135 (2000). https://doi.org/10.1137/S1052623499359178

    Article  MathSciNet  MATH  Google Scholar 

  3. Arora, J.: Introduction to Optimum Design, 3rd edn. Academic Press Inc, Cambridge (2011)

    Google Scholar 

  4. Balles, L., Hennig, P.: Dissecting Adam: the sign, magnitude and variance of stochastic gradients, vol. 1, pp. 1–17 (2018). arXiv:1705.07774v2 [cs.LG]

  5. Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: NIPS 2011, pp. 2546–2554 (2011). arXiv:1206.2944S

  6. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(February), 281–305 (2012). https://doi.org/10.1162/153244303322533223

    Article  MathSciNet  MATH  Google Scholar 

  7. Bertsekas, D.P.: Massachusetts Institute of Technology: Convex Optimization Algorithms, 1st edn. Athena Scientific, Belmont (2015)

    Google Scholar 

  8. Bishop, C.M.: Pattern Recognition and Machine Learning, 1st edn. Springer, Berlin (2006)

    MATH  Google Scholar 

  9. Bollapragada, R., Byrd, R., Nocedal, J.: Adaptive sampling strategies for stochastic optimization, pp. 1–32 (2017). arXiv:1710.11258

  10. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: COMPSTAT 2010, Keynote, Invited and Contributed Papers, vol. 19, pp. 177–186 (2010). https://doi.org/10.1007/978-3-7908-2604-3-16

  11. Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization methods for machine learning. Math. Program. 134(1), 127–155 (2012). https://doi.org/10.1007/s10107-012-0572-5

    Article  MathSciNet  MATH  Google Scholar 

  12. Chen, T., Sun, Y., Shi, Y., Hong, L.: On sampling strategies for neural network-based collaborative filtering, pp. 1–14 (2017). arXiv:1706.07881 [cs.LG]

  13. Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The loss surfaces of multilayer networks. In: AISTATS 2015, vol. 38, pp. 192–204 (2015)

  14. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. ICLR 2014, 1–9 (2014)

    Google Scholar 

  15. Davis, C.: The norm of the Schur product operation. Numer. Math. 4(1), 343–344 (1962). https://doi.org/10.1007/BF01386329

    Article  MathSciNet  MATH  Google Scholar 

  16. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(July), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Engelbrecht, A.P.: Fundamentals of Computational Swarm Intelligence, 1st edn. Wiley, Hoboken (2005)

    Google Scholar 

  18. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936). https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

    Article  Google Scholar 

  19. Floudas, C.A., Pardalos, P.M.: Encyclopedia of Optimization, 2nd edn. Springer, Berlin (2009)

    Book  MATH  Google Scholar 

  20. Friedlander, M.P., Schmidt, M.: Hybrid deterministic-stochastic methods for data fitting, pp. 1–26 (2011). https://doi.org/10.1137/110830629. arXiv:1104.2373 [cs.LG]

  21. Gong, P., Ye, J.: Linear convergence of variance-reduced stochastic gradient without strong convexity (2014). arXiv:1406.1102

  22. Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively characterizing neural network optimization problems. ICLR 2015, 1–11 (2015)

    Google Scholar 

  23. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., Kavukcuoglu, K.: Population based training of neural networks, pp. 1–13 (2017). arXiv:1711.09846

  24. Johnson, B., Tateishi, R., Xie, Z.: Using geographically weighted variables for image classification. Remote Sens. Lett. 3(6), 491–499 (2012). https://doi.org/10.1080/01431161.2011.629637

    Article  Google Scholar 

  25. Johnson, B.A., Tateishi, R., Hoan, N.T.: A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. Int. J. Remote Sens. 34(20), 6969–6982 (2013). https://doi.org/10.1080/01431161.2013.810825

    Article  Google Scholar 

  26. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz Condition. In: ECML PKDD: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, vol. 9851, pp. 795–811. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_50

  27. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. ICLR 2015, 1–15 (2015). https://doi.org/10.1145/1830483.1830503

    Article  Google Scholar 

  28. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes, pp. 1–14 (2013). https://doi.org/10.1051/0004-6361/201527329. arXiv:1312.6114v10

  29. Krizhevsky, A., Hinton, G.E.: Learning Multiple Layers of Features from Tiny Images. University of Toronto, Toronto (2009)

    Google Scholar 

  30. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  31. Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1–10 (2014). https://doi.org/10.1145/2623330.2623612

  32. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An asynchronous parallel stochastic coordinate descent algorithm. J. Mach. Learn. Res. 16, 285–322 (2015)

    MathSciNet  MATH  Google Scholar 

  33. Lucas, D.D., Klein, R., Tannahill, J., Ivanova, D., Brandon, S., Domyancic, D., Zhang, Y.: Failure analysis of parameter-induced simulation crashes in climate models. Geosci. Model Dev. 6(4), 1157–1171 (2013). https://doi.org/10.5194/gmd-6-1157-2013

    Article  Google Scholar 

  34. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46–47(1), 157–178 (1993). https://doi.org/10.1007/BF02096261

    Article  MathSciNet  MATH  Google Scholar 

  35. Mahsereci, M., Hennig, P.: Probabilistic line searches for stochastic optimization. J. Mach. Learn. Res. 18, 1–59 (2017)

    MathSciNet  MATH  Google Scholar 

  36. Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V.: Quantitative structure–activity relationship models for ready biodegradability of chemicals. J. Chem. Inf. Model. 53(4), 867–878 (2013). https://doi.org/10.1021/ci4000213

    Article  Google Scholar 

  37. Martens, J.: Deep learning via Hessian-free optimization. ICML 2010, 1–6 (2010). https://doi.org/10.1155/2011/176802

    Article  Google Scholar 

  38. Marwala, T.: Bayesian training of neural networks using genetic programming. Pattern Recogn. Lett. 28(12), 1452–1458 (2007). https://doi.org/10.1016/J.PATREC.2007.03.004

    Article  Google Scholar 

  39. Montana, D.J., Davis, L.: Training feedforward neural networks using genetic algorithms (1989)

  40. Nash, W.J., Sellers, T.L., Talbot, S.R., Cawthorn, A.J., Ford, W.B.: The Population Biology of Abalone (\_Haliotis\_ species) in Tasmania. I. Blacklip Abalone (\_H. rubra\_) from the North Coast and Islands of Bass Strait. Technical report, Sea Fisheries Division (1994)

  41. Nesterov, Y.: Primal–dual subgradient methods for convex problems. Math. Program. Ser. B 120, 221–259 (2009). https://doi.org/10.1007/s10107-007-0149-x

    Article  MathSciNet  MATH  Google Scholar 

  42. Paschke, F., Bayer, C., Bator, M., Mönks, U., Dicks, A., Enge-Rosenblatt, O., Lohweg, V.: Sensorlose Zustandsüberwachung an Synchronmotoren. In: Conference: 23. Workshop Computational Intelligence (VDI/VDE-Gesellschaft Mess- und Automatisierungstechnik (GMA)). Dortmund (2013)

  43. Prechelt, L.: PROBEN1—a set of neural network benchmark problems and benchmarking rules (Technical Report 21-94). Technical report, Universität Karlsruhe (1994)

  44. pytorch.org: PyTorch. https://pytorch.org/ (2019). Version: 1.0

  45. Radiuk, P.M.: Impact of training set batch size on the performance of convolutional neural networks for diverse datasets. Inf. Technol. Manag. Sci. 20(1), 20–24 (2017). https://doi.org/10.1515/itms-2017-0003

    Article  Google Scholar 

  46. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951). https://doi.org/10.1214/aoms/1177729586

    Article  MathSciNet  MATH  Google Scholar 

  47. Ruder, S.: An overview of gradient descent optimization algorithms, pp. 1–14 (2016). https://doi.org/10.1111/j.0006-341X.1999.00591.x. arXiv:1609.04747v2 [cs.LG],

  48. Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, pp. 1–22 (2013). CoRR arXiv:1312.6120

  49. Shor, N.Z.: Minimization Methods for Non-Differentiable Functions, 1st edn. Springer, Berlin (1985)

    Book  MATH  Google Scholar 

  50. Shor, N.Z.: The subgradient method. In: Minimization Methods for Non-Differentiable Functions, pp. 22–47. Springer, Berlin (1985)

  51. Snoek, J., Larochelle, H., Adams, R.: Practical Bayesian optimization of machine learning algorithms. In: NIPS, pp. 1–9 (2012). arXiv:1206.2944S

  52. Snyman, J.A., Wilke, D.N.: Practical Mathematical Optimization. Springer Optimization and Its Applications, vol. 133. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77586-9

    Book  MATH  Google Scholar 

  53. Tong, F., Liu, X.: Samples selection for artificial neural network training in preliminary structural design. Tsinghua Sci. Technol. 10(2), 233–239 (2005). https://doi.org/10.1016/S1007-0214(05)70060-2

    Article  Google Scholar 

  54. Vurkaç, M.: Clave-direction analysis: a new arena for educational and creative applications of music technology. J. Music Technol. Educ. 4(1), 27–46 (2011). https://doi.org/10.1386/jmte.4.1.27_1

    Article  Google Scholar 

  55. Werbos, P.J.: The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting. Wiley, New York, NY (1994)

    Google Scholar 

  56. Wilke, D.N., Kok, S., Snyman, J.A., Groenwold, A.A.: Gradient-only approaches to avoid spurious local minima in unconstrained optimization. Optim. Eng. 14(2), 275–304 (2013). https://doi.org/10.1007/s11081-011-9178-7

    Article  MathSciNet  MATH  Google Scholar 

  57. Yeh, I.C., Lien, C.H.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36(2), 2473–2480 (2009). https://doi.org/10.1016/J.ESWA.2007.12.020

    Article  Google Scholar 

  58. Zhang, C., Öztireli, C., Mandt, S., Salvi, G.: Active mini-batch sampling using repulsive point processes (2018). ArXiv:1804.02772

  59. Zhang, H., Yin, W.: Gradient methods for convex minimization: better rates under weaker conditions. ArXiv e-prints (2013)

  60. Ziȩba, M., Tomczak, S.K., Tomczak, J.M.: Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Syst. Appl. 58, 93–101 (2016). https://doi.org/10.1016/J.ESWA.2016.04.001

    Article  Google Scholar 

  61. Zuo, X., Chintala, S.: Basic VAE example. https://github.com/pytorch/examples/tree/ master/vae (2018). Accessed on 7 May 2018

Download references

Acknowledgements

This work was supported by the Centre for Asset and Integrity Management (C-AIM), Department of Mechanical and Aeronautical Engineering, University of Pretoria, Pretoria, South Africa. We would also like to thank NVIDIA for sponsoring the Titan X Pascal GPU used in this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominic Kafka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Artificial neural networks

The single and double hidden layer feedforward neural network architectures are expressed mathematically by Eqs. (23) and (24) respectively below. The optimization vector, \({\textit{\textbf{x}}}\) is sectioned and transformed into matrices \({\textit{\textbf{X}}}^{(c)}\) for the relevant weights in connection layers c of the network. The given data observation pair \({\textit{\textbf{t}}}_b\) is separated to give the input data, \({\textit{\textbf{T}}}^i_b\), and output data, \({\textit{\textbf{T}}}^o_b\). Suppose a given dataset has an input domain, \({\textit{\textbf{T}}}^i\) with \(|{\mathcal {B}}|\) observations and D dimensions (features). The respective output domain, \({\textit{\textbf{T}}}^o\), has corresponding observations \(|{\mathcal {B}}|\) and output dimensions E (classes). Then for every observation b and every output dimension e, a prediction of the output data \(\hat{{\textit{\textbf{T}}}}^o\) can be constructed from the original data input domain \({\textit{\textbf{T}}}^i\), given by

$$\begin{aligned} \hat{{\textit{\textbf{T}}}}_{be}^o = a_{outer} \left( \sum _{b=1}^{M_1} {\textit{\textbf{X}}}_{ej}^{(2)} a_{inner} \left( \sum _{i=1}^{D} {\textit{\textbf{X}}}_{ji}^{(1)} {\textit{\textbf{T}}}_{bi}^i + {\textit{\textbf{X}}}_{j0}^{(1)}\right) + {\textit{\textbf{X}}}_{e0}^{(2)}\right) , \end{aligned}$$
(23)

for a single hidden layer neural network and

$$\begin{aligned} \hat{{\textit{\textbf{T}}}}_{be}^o = a_{outer} \left( \sum _{l=1}^{M_2} {\textit{\textbf{X}}}_{le}^{(3)} a_{inner}^{(2)} \left( \sum _{b=1}^{M_1} {\textit{\textbf{X}}}_{ej}^{(2)} a_{inner}^{(1)} \left( \sum _{i=1}^{D} {\textit{\textbf{X}}}_{ji}^{(1)} {\textit{\textbf{T}}}_{bi}^i + {\textit{\textbf{X}}}_{j0}^{(1)} \right) + {\textit{\textbf{X}}}_{e0}^{(2)} \right) + {\textit{\textbf{X}}}_{l0}^{(3)} \right) , \end{aligned}$$
(24)

for a double hidden layer neural network.

\(M_{n}\), \(n \in [1,2]\) gives number of nodes in the respective hidden layers. The nodal activation function is denoted by a and \({\textit{\textbf{X}}}^{(c)}\), \(c \in [1,2,3]\) denotes the set of weights connecting sequential layers in the network between the input layer and the output layer in a forward direction. Thus the single hidden layer network has two sets of weights, \({\textit{\textbf{X}}}^{(c)}\), and the double hidden layer network has three respectively [8].

The nodal weights \({\textit{\textbf{x}}}\) are optimized to a configuration which best captures the relationship between the input and output data spaces. The loss-function used is the mean squared error (MSE), determined over every b in batch size \(|{\textit{\textbf{B}}}|\) and every class \(e \in E\) according to the Proben1 dataset guidelines [43] as:

$$\begin{aligned} \ell ({\textit{\textbf{x}}},{\textit{\textbf{t}}}_b) = \frac{100}{E\cdot |{\textit{\textbf{B}}}|} \sum _{b = 1 }^{|{\textit{\textbf{B}}}|} \sum _{e=1}^{E}(\hat{{\textit{\textbf{T}}}}_{be}^o({\textit{\textbf{x}}}) - {\textit{\textbf{T}}}_{be}^o)^2, \end{aligned}$$
(25)

where \(\hat{{\textit{\textbf{T}}}}^o({\textit{\textbf{x}}})\) is the output estimation of the current network configuration as a function of the weights, and \({\textit{\textbf{T}}}^o\) is the target output of the corresponding training dataset samples.

B Exact line search: gradient-only line search with bisection (GOLS-B)

The directional derivative values used in this method are defined as \(F'_n(\alpha ) = {{\textit{\textbf{g}}}}({\textit{\textbf{x}}}_n + \alpha \cdot {\textit{\textbf{d}}}_n)^T{\textit{\textbf{d}}}_n \) and the search direction, \({\textit{\textbf{d}}}_n\), at the respective values for \(\alpha \) at the different points.

figure b

C Inexact line search: gradient-only line search that is inexact (GOLS-I)

Parameters used for this method are: \(\eta = 2\), \(c_2 = 0.9\), \(\alpha _{min} = 10^{-8}\) and \(\alpha _{max} = 10^7\). \(F'_n(\alpha ) = {{\textit{\textbf{g}}}}({\textit{\textbf{x}}}_n + \alpha \cdot {\textit{\textbf{d}}}_n)^T{\textit{\textbf{d}}}_n \).

figure c

D Inexact line search: gradient-only line search maximizing step size (GOLS-Max)

Parameters used for this method are: \(\eta = 2\), \(c_2 = 0.9\), \(\alpha _{min} = 10^{-8}\) and \(\alpha _{max} = 10^7\). \(F'_n(\alpha ) = {{\textit{\textbf{g}}}}({\textit{\textbf{x}}}_n + \alpha \cdot {\textit{\textbf{d}}}_n)^T{\textit{\textbf{d}}}_n \).

figure d

E Inexact line search: gradient-only line search with backtracking (GOLS-Back)

Parameters used for this method are: \(\eta = 2\), \(c_2 = 0\), \(\alpha _{min} = 10^{-8}\) and \(\alpha _{max} = 10^7\). \(F'_n(\alpha ) = {{\textit{\textbf{g}}}}({\textit{\textbf{x}}}_n + \alpha \cdot {\textit{\textbf{d}}}_n)^T{\textit{\textbf{d}}}_n \).

figure e

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kafka, D., Wilke, D.N. Resolving learning rates adaptively by locating stochastic non-negative associated gradient projection points using line searches. J Glob Optim 79, 111–152 (2021). https://doi.org/10.1007/s10898-020-00921-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-020-00921-z

Keywords

Navigation