A non-convergent on-line training algorithm for neural networks
Stopped training is a method to avoid over-fitting of neural network models by preventing an iterative optimization method from reaching a local minimum of the objective function. It is motivated by the observation that over-fitting occurs gradually as training progresses. The stopping time is typically determined by monitoring the expected generalization performance of the model as approximated by the error on a validation set. In this paper we propose to use an analytic estimate for this purpose. However, these estimates require knowledge of the analytic form of the objective function used for training the network and are only applicable when the weights correspond to a local minimum of this objective function. For this reason, we propose the use of an auxiliary, regularized objective function. The algorithm is “self-contained” and does not require to split the data in a training and a separate validation set.
Unable to display preview. Download preview PDF.
- Barron, A. (1984), Predicted squared error: a criterion for automatic model selection, in S. Farlow, ed., ‘Self-Organizing Methods in Modeling', Marcel Dekker, New York.Google Scholar
- Chauvin, Y. (1991), Generalization dynamics in LMS trained linear networks, in R. P. Lippman, J. E. Moody and D. S. Touretzky, eds, ‘Advances in Neural Information Processing Systems 3', Morgan Kaufmann Publishers, San Mateo, CA, pp. 890–896.Google Scholar
- Finnoff, W., Hergert, F. and Zimmermann, H. G. (1993), ‘Improving model selection by nonconvergent methods', Neural Networks 6, 771–783.Google Scholar
- Geman, S., Bienenstock, E. and Doursat, R. (1992), ‘Neural networks and the bias/variance dilemma', Neural Computation 4(1), 1–58.Google Scholar
- Larsen, J. (1992), A generalization error estimate for nonlinear systems, in ‘Proceedings of the 1992 IEEE Workshop on Neural Networks for Signal Processing', IEEE Service Center, Piscataway, NJ, pp. 29–38.Google Scholar
- Leen, T. K. and Orr, G. B. (1992), Weight-space probability densities and convergence times for stochastic learning, in ‘Int. Joint Conference on Neural Networks', Vol. 4, Baltimore, MD, pp. 158–164.Google Scholar
- Moody, J. E. (1991), Note on generalization, regularization and architecture selection in nonlinear learning systems, in B. H. Juang, S. Y. Kung and C. A. Kamm, eds, ‘Neural Networks for Signal Processing', IEEE Signal Processing Society, pp. 1–10.Google Scholar
- Moody, J. and Utans, J. (1994), Architecture selection strategies for neural networks: Application to corporate bond rating prediction, in A. N. Refenes, ed., ‘Neural Networks in the Captial Markets', John Wiley & Sons.Google Scholar
- Sjöberg, J. and Ljung, L. (1992), Overtraining, regularization, and searching for minimum in neural networks, technical Report LiTH-ISY-I-1297, Dept. of Electrical Engineering, Linköping University, S-581 83 Linköping, Sweden.Google Scholar
- Stone, M. (1978), ‘Cross-validation: A review', Math. Operationsforsch. Statist., Ser. Statistics 9(1).Google Scholar
- Weigend, A. S. and Rummelhart, D. E. (1991), The effective dimension of the space of hidden units, in ‘Proceedings of the International Joint Conference on Neural Networks', Vol. III, Singapore, pp. 2069–2074.Google Scholar