Skip to main content
Log in

The weight-decay technique in learning from data: an optimization point of view

  • Original Paper
  • Published:
Computational Management Science Aims and scope Submit manuscript

Abstract

The technique known as “weight decay” in the literature about learning from data is investigated using tools from regularization theory. Weight-decay regularization is compared with Tikhonov’s regularization of the learning problem and with a mixed regularized learning technique. The accuracies of suboptimal solutions to weight-decay learning are estimated for connectionistic models with a-priori fixed numbers of computational units.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aarts E, Korst J (1989) Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing. Wiley,

  • Aronszajn N (1950) Theory of reproducing kernels. Trans AMS 68: 337–404

    Article  Google Scholar 

  • Bartlett PL (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44(2): 525–536

    Article  Google Scholar 

  • Berg C, Christensen JPR, Ressel P (1984) Harmonic analysis on semigroups. Springer, New York

    Google Scholar 

  • Bertero M (1989) Linear inverse and ill-posed problems. Adv Electron Electron Phys 75: 1–120

    Google Scholar 

  • Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont

    Google Scholar 

  • Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, London

    Google Scholar 

  • Bishop C (2006) Pattern recognition and machine learning. Springer, Heidelberg

    Google Scholar 

  • Burger M, Engl H (2000) Training neural networks with noisy data as an ill-posed problem. Adv Comput Math 13: 335–354

    Article  Google Scholar 

  • Burger M, Neubauer A (2002) Analysis of Tikhonov regularization for function approximation by neural networks. Neural Netw 16: 79–90

    Article  Google Scholar 

  • Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, London

    Google Scholar 

  • Cucker F, Smale S (2001) On the mathematical foundations of learning. Bull AMS 39: 1–49

    Article  Google Scholar 

  • Cucker F, Smale S (2002) Best choices for regularization parameters in learning theory: on the bias-variance problem. Found Comput Math 2: 413–428

    Article  Google Scholar 

  • Cuesta-Albertos JA, Wschebor M (2003) Some remarks on the condition number of a real random square matrix. J Compl 19: 548–554

    Article  Google Scholar 

  • Demmel J (1987) The geometry of ill-conditioning. J Compl 3: 201–229

    Article  Google Scholar 

  • Dontchev AL (1983) Perturbations, approximations and sensitivity analysis of optimal control systems. Lecture Notes in Control and Information Sciences, vol 52. Springer, Berlin

  • Friedman A (1970) Foundations of Modern Analysis. Holt, Rinehart, and Winston, New York

    Google Scholar 

  • Girosi F, Jones M, Poggio T (1995) Regularization theory and neural networks architectures. Neural Comput 7: 219–269

    Article  Google Scholar 

  • Girosi F (1998) An equivalence between sparse approximation and support vector machines. Neural Comput 10: 1455–1480

    Article  Google Scholar 

  • Girosi F (1994) Regularization theory, radial basis functions and networks. In: Cherkassky JHFV, Wechsler H(eds) From Statistics to Neural Networks. Theory and pattern recognition applications, ser. NATO ASI Series F, Computer and Systems Sciences. Springer, Berlin, pp 166–187

    Google Scholar 

  • Gnecco G, Sanguineti M (2007) Accuracy of suboptimal solutions to kernel principal component analysis. Comput Optim Appl. doi:10.1007/s10589-007-9108-y

  • Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, Reading

    Google Scholar 

  • Golub GH, Loan CFV (1996) Matrix computations. John Hopkins University Press, London

    Google Scholar 

  • Gupta A, Lam M (1998) The weight decay backpropagation for generalizations with missing values. Ann Oper Res 78: 165–187

    Article  Google Scholar 

  • Gupta A, Lam M (1998) Weight decay backpropagation for noisy data. Neural Netw 11: 1127–1138

    Article  Google Scholar 

  • Hofinger A (2006) Nonlinear function approximation: computing smooth solutions with an adaptive greedy algorithm. J Approxim Theory 143: 159–175

    Article  Google Scholar 

  • Hofinger A, Pillichshammer F (2005) Learning a function from noisy samples at a finite sparse set of points. J. Kepler University, Linz, Technical Report, SFB F013

  • Kimeldorf GS, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41: 495–502

    Article  Google Scholar 

  • Krogh A, Hertz JA (1992) A simple weight decay can improve generalization. In: Advances in neural information processing systems, vol. 4. Morgan Kaufmann Pub., pp 950–957

  • Kůrková V (1997) Dimension-independent rates of approximation by neural networks. In: Warwick K, Kárný M(eds) Computer-intensive methods in control and signal processing. The curse of dimensionality.. Birkhäuser, Boston, pp 261–270

    Google Scholar 

  • Kůrková V (2004) Learning from data as an inverse problem. In: Antoch J(eds) COMPSTAT 2004—proceedings in computational statistics. Physica-Verlag/Springer, Heidelberg, pp 1377–1384

    Google Scholar 

  • Kůrková V, Sanguineti M (2001) Bounds on rates of variable-basis and neural-network approximation. IEEE Trans Inf Theory 47: 2659–2665

    Article  Google Scholar 

  • Kůrková V, Sanguineti M (2005) Error estimates for approximate optimization by the extended Ritz method. SIAM J Optim 15: 461–487

    Article  Google Scholar 

  • Kůrková V, Sanguineti M (2005) Learning with generalization capability by kernel methods of bounded complexity. J Compl 21: 350–367

    Article  Google Scholar 

  • Kůrková V, Savický P, Hlaváčková K (1998) Representations and rates of approximation of real-valued Boolean functions by neural networks. Neural Netw 11: 651–659

    Article  Google Scholar 

  • Levitin ES, Polyak BT (1966) Convergence of minimizing sequences in conditional extremum problems. Dokl Akad Nauk SSSR 168(5): 764–767

    Google Scholar 

  • Ortega JM (1990) Numerical analysis: a second course. SIAM, Philadelphia

    Google Scholar 

  • Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78: 1481–1497

    Article  Google Scholar 

  • Poggio T, Smale S (2003) The mathematics of learning: dealing with data. Notices AMS 50: 536–544

    Google Scholar 

  • Poggio T, Mukherjee S, Rifkin R, Rakhlin A, Verri A (2002) “b”. In: Winkler J, Niranjan M (eds) Uncertainty in Geometric Computations. Kluwer, Dordrecht, pp 131–141

  • Schölkopf B, Smola AJ (2002) Learning with kernels—support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  • Schölkopf B, Herbrich R, Smola AJ, Williamson RC (2001) A generalized representer theorem. In: Proceedings of COLT’01, Lecture Notes in Artificial Intelligence. Springer, Heidelberg, pp 416–424

  • Tikhonov AN, Arsenin VY (1977) Solutions of ill-posed problems. W.H. Winston, Washington

    Google Scholar 

  • Treadgold NK, Gedeon TD (1998) Simulated annealing and weight decay in adaptive learning: the SARPROP algorithm. IEEE Trans Neural Netw 9(4): 662–668

    Article  Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. Wiley, New York

    Google Scholar 

  • Vladimirov AA, Nesterov YE, Chekanov YN (1978) On uniformly convex functionals. Vestnik Moskovskogo Universiteta. Seriya 15—Vychislitel’naya Matematika i Kibernetika, vol 3, pp 12–23 (English translation: Moscow University Computational Mathematics and Cybernetics, pp 10–21, 1979)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcello Sanguineti.

Additional information

The Authors were partially supported by a PRIN grant from the Italian Ministry for University and Research, project “Models and Algorithms for Robust Network Optimization”.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gnecco, G., Sanguineti, M. The weight-decay technique in learning from data: an optimization point of view. Comput Manag Sci 6, 53–79 (2009). https://doi.org/10.1007/s10287-008-0072-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10287-008-0072-5

Keywords

Navigation