Skip to main content

Multi-layer Perceptrons

  • Chapter
  • First Online:
Computational Intelligence

Abstract

Having described the structure, the operation and the training of (artificial) neural networks in a general fashion in the preceding chapter, we turn in this and the subsequent chapters to specific forms of (artificial) neural networks. We start with the best known and most widely used form, the so-called multi-layer perceptron (MLP), which is closely related to the networks of threshold logic units. They exhibit a strictly layered structure and may employ other activation functions than a step at a crisp threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.pytorch.org and https://www.tensorflow.org, respectively.

  2. 2.

    Conservative logic is a mathematical model for computations and computational powers of computers, in which the fundamental physical principles that govern computing machines are explicitly taken into account. Among these principles are, for instance, that the speed with which information can travel as well as the amount of information that can be stored in the state of a finite system are both finite [6].

  3. 3.

    In the following, we assume implicitly that the output function of all neurons is the identity. Only the activation functions are changed to obtain certain beneficial properties.

  4. 4.

    Note that there are not even any anomalies at the step borders, because the step border is included in the step to the right of it, but excluded from the step to the left of it.

  5. 5.

    Note that this approach is not easily transferred to functions with multiple arguments. For this to be possible, the influences of the two or more inputs have to be independent in a certain sense.

  6. 6.

    Note, however, that with this approach the sum of squared errors is minimized in the transformed space (coordinates \(x' = \ln x\) and \(y' = \ln y\)), but this does not imply that it is also minimized in the original space (coordinates x and y). Nevertheless, this approach usually yields very good results or at least an initial solution that may then be improved by other means.

  7. 7.

    Note again that with this procedure the sum of squared errors is minimized in the transformed space (coordinates x and \(z = \ln \big (\frac{Y-y}{y}\big )\)), but this does not imply that it is also minimized in the original space (coordinates x and y), cf. the preceding footnote.

  8. 8.

    Unless the output function is not differentiable. However, we usually assume (implicitly) that the output function is the identity and thus does not introduce any problems.

  9. 9.

    In statistics this is often also referred to as logistic regression, which can cause some confusion.

  10. 10.

    Nevertheless the cross entropy is sometimes also defined via a natural logarithm.

  11. 11.

    As for the sum of squared errors, applying the Newton–Raphson method to find a root of the gradient is an alternative, which we do not consider here.

  12. 12.

    In order to avoid this factor right from the start, the error of an output neuron is sometimes defined as \(e_u^{(l)} = \frac{1}{2} \big (o_u^{(l)} - \text {out}_u^{(l)}\big )^2\). In this way the factor 2 simply cancels in the derivation.

  13. 13.

    Note that the bias value \(\theta _u\) is already contained in the extended weight vector.

  14. 14.

    In contrast to the variance about the mean, for which the squares of the differences to the mean are summed, the squares of the values themselves are summed for the raw variance.

  15. 15.

    The added \(\epsilon \) is only a technical trick for an implementation as it rules out a division by zero.

  16. 16.

    A side issue, which one should be careful about in an implementation, is that the two \(\epsilon \) (which serve the purpuse of preventing a vanishing step size as well as a division by zero) appear inside the square root, while in the preceding methods \(\epsilon \) was added outside the square root. This is not a typographic error, but can be found like this in the original definitions of these methods.

  17. 17.

    “Xavier” is the given name of the first author of the paper that proposed this method.

  18. 18.

    “He” is the family name, “Kaiming” the given name of the first author of the proposing paper.

References

  1. Anderson ES (1935) The Irises of the Gaspé Peninsula. Bull Am Iris Soc 59:2–5. American Iris Society, Philadelphia, PA, USA

    Google Scholar 

  2. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. IEEE Press, Piscataway, NJ, USA

    Google Scholar 

  3. Bengio Y, LeCun Y, Hinton G (2015) Deep learning. Nature 521:436–444. Nature Publishing Group, London, United Kingdom

    Google Scholar 

  4. Dozat T (2016) Incorporating nesterov momentum into adam. In: Workshop Proceedings of 4th international conference on learning representations (ICLR 2016, San Juan, Puerto Rico). openreview.net

    Google Scholar 

  5. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159. JMLR Publishers

    Google Scholar 

  6. Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21(3/4):219–253. Plenum Press, New York, NY, USA

    Google Scholar 

  7. Fahlman SE (1988) An empirical study of learning speed in backpropagation networks. In: [Touretzky et al. 1988]

    Google Scholar 

  8. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7(2):179–188. Wiley, Chichester, United Kingdom

    Google Scholar 

  9. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings 13th international conference on artificial intelligence and statistics (AISTATS 2010, Chia Laguna Resort, Sardinia, Italy), pp 249–256. PMLR Press

    Google Scholar 

  10. Goodfellow I, Bengio Y, Courville A (2015) Deep learning. MIT Press, Cambridge, MA, USA

    MATH  Google Scholar 

  11. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings IEEE international conference on computer vision (ICCV 2015, Las Condes, Chile), pp 1026–1034. IEEE Press, Piscataway, NJ, USA

    Google Scholar 

  12. He K, Zhang X, Ren S, Sun J (2016) Learning deep residual, for image recognition. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR, (2016) Las vegas, NV), pp 770–778. IEEE Press, Piscataway, NJ, USA

    Google Scholar 

  13. Hochreiter S (1991) Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Technische Universit"at M"unchen, Germany

    Google Scholar 

  14. Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: [Kremer and Kolen 2001]

    Google Scholar 

  15. Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257. Elsevier Science, Amsterdam, Netherlands

    Google Scholar 

  16. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd international conference machine learning (ICML 2015, Lille, France), pp 448–456. PMLR Press

    Google Scholar 

  17. Jakobs RA (1988) Increased rates of convergence through learning rate adaption. Neural Netw 1:295–307. Pergamon Press, Oxford, United Kingdom

    Google Scholar 

  18. Jang H, Park A, Jung K (2008) Using neural network implementation, CUDA and OpenMP. In: Proceedings of digital image computing: techniques and applications (DICTA, (2008) Canberra, Australia), pp 155–161. IEEE Press, Piscataway, NJ, USA

    Google Scholar 

  19. Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of 3rd international conference on learning representations (ICLR 2015, San Diego, CA). openreview.net

    Google Scholar 

  20. Kremer SC, Kolen JF (eds) (2001) A field guide to dynamical recurrent neural networks. IEEE Press, Piscataway, NJ, USA

    Google Scholar 

  21. McCluskey EJ, Jr (1956) Minimization of boolean functions. Bell Syst Techl J 35(6):1417–1444. American Telephone and Telegraph Company, New York, NY, USA

    Google Scholar 

  22. Nesterov YE (1983) A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Soviet Math Doklady 27(2):372–376. Akademia Nauk USSR, Moscow, USSR

    Google Scholar 

  23. Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numerica 8:143-196. Cambridge University Press, Cambridge, United Kingdom

    Google Scholar 

  24. Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17. Dorodnitsyn Computing Centre, Moscow, USSR

    Google Scholar 

  25. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C—the art of scientific computing, 2nd edn. Cambridge University Press, Cambridge, United Kingdom

    Google Scholar 

  26. Quine WV (1952) The problem of simplifying truth functions. Am Math Mon 59(8):521–531. Mathematical Association of America, Washington, DC, USA

    Google Scholar 

  27. Quine WV (1955) A way to simplify truth functions. Am Math Mon 62(9):627–631. Mathematical Association of America, Washington, DC, USA

    Google Scholar 

  28. Riedmiller M, Braun H (1992) Rprop—A Fast Adaptive Learning Algorithm. Technical Report, University of Karlsruhe, Karlsruhe, Germany

    Google Scholar 

  29. Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: International conference on neural networks (ICNN-93, San Francisco, CA). IEEE Press, Piscataway, NJ, USA, pp 586–591

    Google Scholar 

  30. Rosenbrock HH (1960) An automatic method for finding the greatest or least value of a function. Comput J 3(3):175–184. Oxford University Press, Oxford, United Kingdom

    Google Scholar 

  31. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536

    Google Scholar 

  32. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958. MIT Press, Cambridge, MA, USA

    Google Scholar 

  33. Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Netw Mach Learn 4:26–31

    Google Scholar 

  34. Tollenaere T (1990) SuperSAB: fast adaptive backpropagation with good scaling properties. Neural Netw 3:561–573

    Google Scholar 

  35. Touretzky D, Hinton G, Sejnowski T (eds) Proceedings of the connectionist models summer school (Carnegie Mellon University). Morgan Kaufman, San Mateo, CA, USA

    Google Scholar 

  36. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408. MIT Press, Cambridge, MA, USA

    Google Scholar 

  37. Werbos PJ (1974) Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD Thesis, Harvard University, Cambridge, MA, USA

    Google Scholar 

  38. Zeiler MD (2012) AdaDelta: an adaptive learning rate method. Comput Res Repository (CoRR)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanaz Mostaghim .

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kruse, R., Mostaghim, S., Borgelt, C., Braune, C., Steinbrecher, M. (2022). Multi-layer Perceptrons. In: Computational Intelligence. Texts in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-42227-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-42227-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-42226-4

  • Online ISBN: 978-3-030-42227-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics