Abstract
We consider the class of incremental gradient methods for minimizing a sum of continuously differentiable functions. An important novel feature of our analysis is that the stepsizes are kept bounded away from zero. We derive the first convergence results of any kind for this computationally important case. In particular, we show that a certain ε-approximate solution can be obtained and establish the linear dependence of ε on the stepsize limit. Incremental gradient methods are particularly well-suited for large neural network training problems where obtaining an approximate solution is typically sufficient and is often preferable to computing an exact solution. Thus, in the context of neural networks, the approach presented here is related to the principle of tolerant training. Our results justify numerous stepsize rules that were derived on the basis of extensive numerical experimentation but for which no theoretical analysis was previously available. In addition, convergence to (exact) stationary points is established when the gradient satisfies a certain growth property.
Similar content being viewed by others
References
D.P. Bertsekas, “A new class of incremental gradient methods for least squares problems,” SIAM J. on Optimization, vol. 7, pp. 913-926, 1997.
D.P. Bertsekas, Nonlinear Programming, Athena Scientific: Belmont, MA, 1995.
D.P. Bertsekas, “Incremental least squares methods and the extended Kalman filter,” SIAM Journal on Optimization, vol. 6, pp. 807-822, 1996.
D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific: Belmont, MA, 1996.
A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing, John Wiley & Sons: New York, 1994.
A.A. Gaivoronski, “Convergence properties of backpropagation for neural networks via theory of stochastic gradient methods. Part I,” Optimization Methods and Software, vol. 4, pp. 117-134, 1994.
T. Khanna, Foundations of Neural Networks, Addison-Wesley: NJ, 1989.
K. Lang, A.Waibel, and G. Hinton, “A time-delay neural network architecture for isolated word recognition,” Neural Networks, vol. 3, pp. 23-43, 1990.
Z.-Q. Luo, “On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks,” Neural Computation, vol. 3, pp. 226-245, 1991.
Z.-Q. Luo and P. Tseng, “Analysis of an approximate gradient projection method with applications to the backpropagation algorithm,” Optimization Methods and Software, vol. 4, pp. 85-101, 1994.
O.L. Mangasarian, “Mathematical programming in neural networks,” ORSA Journal on Computing, vol. 5,no. 4, pp. 349-360, 1993.
O.L. Mangasarian and M.V. Solodov, “Backpropagation convergence via deterministic nonmonotone perturbed minimization,” in Advances in Neural Information Processing Systems 6, G. Tesauro, J.D. Cowan, and J. Alspector (Eds.), Morgan Kaufmann: San Francisco, CA, 1994, pp. 383-390.
O.L. Mangasarian and M.V. Solodov, “Serial and parallel backpropagation convergence via nonmonotone perturbed minimization,” Optimization Methods and Software, vol. 4, pp. 103-116, 1994.
E. Polak, Computational Methods in Optimization: A Unified Approach, Academic Press: New York, 1971.
B.T. Polyak, Introduction to Optimization, Optimization Software, Inc. Publications Division: New York, 1987.
D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning internal representations by error propagation,” in Parallel Distributed Processing, D.E. Rumelhart and J.L. McClelland (Eds.), MIT Press: Cambridge, MA, 1986, pp. 318-362.
S. Shah, F. Palmieri, and M. Datum, “Optimal filtering algorithms for fast learning in feedforward neural networks,” Neural Networks, vol. 5, pp. 779-787, 1992.
M.V. Solodov, “Convergence analysis of perturbed feasible descent methods,” Journal of Optimization Theory and Applications, vol. 93, no.2, pp. 337-353, May 1997.
M.V. Solodov and S.K. Zavriev, “Error-stabilty properties of generalized gradient-type algorithms,” Technical Report Mathematical Programming 94-05, Computer Science Department, University of Wisconsin, 1210 West Dayton Street, Madison, Wisconsin 53706, USA, June 1994. Journal of Optimization Theory and Applications, vol. 98, no.3, September 1998.
W.N. Street and O.L. Mangasarian, “Improved generalization via tolerant training,” Technical Report 95-11, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin 53706, USA, July 1995. Journal of Optimization Theory and Applications, vol. 96, pp. 259-279, 1998.
P. Tseng, “Incremental gradient(-projection) method with momentum term and adaptive stepsize rule,” SIAM J. on Optimization, vol. 8, pp. 506-531, 1998.
H. White, “Some asymptotic results for learning in single hidden-layer feedforward network models,” Journal of the American Statistical Association, vol. 84, no.408, pp. 1003-1013, 1989.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Solodov, M. Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero. Computational Optimization and Applications 11, 23–35 (1998). https://doi.org/10.1023/A:1018366000512
Issue Date:
DOI: https://doi.org/10.1023/A:1018366000512