Abstract
Having described the structure, the operation and the training of (artificial) neural networks in a general fashion in the preceding chapter, we turn in this and the subsequent chapters to specific forms of (artificial) neural networks. We start with the best known and most widely used form, the so-called multi-layer perceptron (MLP), which is closely related to the networks of threshold logic units. They exhibit a strictly layered structure and may employ other activation functions than a step at a crisp threshold.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://www.pytorch.org and https://www.tensorflow.org, respectively.
- 2.
Conservative logic is a mathematical model for computations and computational powers of computers, in which the fundamental physical principles that govern computing machines are explicitly taken into account. Among these principles are, for instance, that the speed with which information can travel as well as the amount of information that can be stored in the state of a finite system are both finite [6].
- 3.
In the following, we assume implicitly that the output function of all neurons is the identity. Only the activation functions are changed to obtain certain beneficial properties.
- 4.
Note that there are not even any anomalies at the step borders, because the step border is included in the step to the right of it, but excluded from the step to the left of it.
- 5.
Note that this approach is not easily transferred to functions with multiple arguments. For this to be possible, the influences of the two or more inputs have to be independent in a certain sense.
- 6.
Note, however, that with this approach the sum of squared errors is minimized in the transformed space (coordinates \(x' = \ln x\) and \(y' = \ln y\)), but this does not imply that it is also minimized in the original space (coordinates x and y). Nevertheless, this approach usually yields very good results or at least an initial solution that may then be improved by other means.
- 7.
Note again that with this procedure the sum of squared errors is minimized in the transformed space (coordinates x and \(z = \ln \big (\frac{Y-y}{y}\big )\)), but this does not imply that it is also minimized in the original space (coordinates x and y), cf. the preceding footnote.
- 8.
Unless the output function is not differentiable. However, we usually assume (implicitly) that the output function is the identity and thus does not introduce any problems.
- 9.
In statistics this is often also referred to as logistic regression, which can cause some confusion.
- 10.
Nevertheless the cross entropy is sometimes also defined via a natural logarithm.
- 11.
As for the sum of squared errors, applying the Newton–Raphson method to find a root of the gradient is an alternative, which we do not consider here.
- 12.
In order to avoid this factor right from the start, the error of an output neuron is sometimes defined as \(e_u^{(l)} = \frac{1}{2} \big (o_u^{(l)} - \text {out}_u^{(l)}\big )^2\). In this way the factor 2 simply cancels in the derivation.
- 13.
Note that the bias value \(\theta _u\) is already contained in the extended weight vector.
- 14.
In contrast to the variance about the mean, for which the squares of the differences to the mean are summed, the squares of the values themselves are summed for the raw variance.
- 15.
The added \(\epsilon \) is only a technical trick for an implementation as it rules out a division by zero.
- 16.
A side issue, which one should be careful about in an implementation, is that the two \(\epsilon \) (which serve the purpuse of preventing a vanishing step size as well as a division by zero) appear inside the square root, while in the preceding methods \(\epsilon \) was added outside the square root. This is not a typographic error, but can be found like this in the original definitions of these methods.
- 17.
“Xavier” is the given name of the first author of the paper that proposed this method.
- 18.
“He” is the family name, “Kaiming” the given name of the first author of the proposing paper.
References
Anderson ES (1935) The Irises of the Gaspé Peninsula. Bull Am Iris Soc 59:2–5. American Iris Society, Philadelphia, PA, USA
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. IEEE Press, Piscataway, NJ, USA
Bengio Y, LeCun Y, Hinton G (2015) Deep learning. Nature 521:436–444. Nature Publishing Group, London, United Kingdom
Dozat T (2016) Incorporating nesterov momentum into adam. In: Workshop Proceedings of 4th international conference on learning representations (ICLR 2016, San Juan, Puerto Rico). openreview.net
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159. JMLR Publishers
Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21(3/4):219–253. Plenum Press, New York, NY, USA
Fahlman SE (1988) An empirical study of learning speed in backpropagation networks. In: [Touretzky et al. 1988]
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7(2):179–188. Wiley, Chichester, United Kingdom
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings 13th international conference on artificial intelligence and statistics (AISTATS 2010, Chia Laguna Resort, Sardinia, Italy), pp 249–256. PMLR Press
Goodfellow I, Bengio Y, Courville A (2015) Deep learning. MIT Press, Cambridge, MA, USA
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings IEEE international conference on computer vision (ICCV 2015, Las Condes, Chile), pp 1026–1034. IEEE Press, Piscataway, NJ, USA
He K, Zhang X, Ren S, Sun J (2016) Learning deep residual, for image recognition. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR, (2016) Las vegas, NV), pp 770–778. IEEE Press, Piscataway, NJ, USA
Hochreiter S (1991) Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Technische Universit"at M"unchen, Germany
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: [Kremer and Kolen 2001]
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257. Elsevier Science, Amsterdam, Netherlands
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd international conference machine learning (ICML 2015, Lille, France), pp 448–456. PMLR Press
Jakobs RA (1988) Increased rates of convergence through learning rate adaption. Neural Netw 1:295–307. Pergamon Press, Oxford, United Kingdom
Jang H, Park A, Jung K (2008) Using neural network implementation, CUDA and OpenMP. In: Proceedings of digital image computing: techniques and applications (DICTA, (2008) Canberra, Australia), pp 155–161. IEEE Press, Piscataway, NJ, USA
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of 3rd international conference on learning representations (ICLR 2015, San Diego, CA). openreview.net
Kremer SC, Kolen JF (eds) (2001) A field guide to dynamical recurrent neural networks. IEEE Press, Piscataway, NJ, USA
McCluskey EJ, Jr (1956) Minimization of boolean functions. Bell Syst Techl J 35(6):1417–1444. American Telephone and Telegraph Company, New York, NY, USA
Nesterov YE (1983) A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Soviet Math Doklady 27(2):372–376. Akademia Nauk USSR, Moscow, USSR
Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numerica 8:143-196. Cambridge University Press, Cambridge, United Kingdom
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17. Dorodnitsyn Computing Centre, Moscow, USSR
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C—the art of scientific computing, 2nd edn. Cambridge University Press, Cambridge, United Kingdom
Quine WV (1952) The problem of simplifying truth functions. Am Math Mon 59(8):521–531. Mathematical Association of America, Washington, DC, USA
Quine WV (1955) A way to simplify truth functions. Am Math Mon 62(9):627–631. Mathematical Association of America, Washington, DC, USA
Riedmiller M, Braun H (1992) Rprop—A Fast Adaptive Learning Algorithm. Technical Report, University of Karlsruhe, Karlsruhe, Germany
Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: International conference on neural networks (ICNN-93, San Francisco, CA). IEEE Press, Piscataway, NJ, USA, pp 586–591
Rosenbrock HH (1960) An automatic method for finding the greatest or least value of a function. Comput J 3(3):175–184. Oxford University Press, Oxford, United Kingdom
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958. MIT Press, Cambridge, MA, USA
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Netw Mach Learn 4:26–31
Tollenaere T (1990) SuperSAB: fast adaptive backpropagation with good scaling properties. Neural Netw 3:561–573
Touretzky D, Hinton G, Sejnowski T (eds) Proceedings of the connectionist models summer school (Carnegie Mellon University). Morgan Kaufman, San Mateo, CA, USA
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408. MIT Press, Cambridge, MA, USA
Werbos PJ (1974) Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD Thesis, Harvard University, Cambridge, MA, USA
Zeiler MD (2012) AdaDelta: an adaptive learning rate method. Comput Res Repository (CoRR)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kruse, R., Mostaghim, S., Borgelt, C., Braune, C., Steinbrecher, M. (2022). Multi-layer Perceptrons. In: Computational Intelligence. Texts in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-42227-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-42227-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-42226-4
Online ISBN: 978-3-030-42227-1
eBook Packages: Computer ScienceComputer Science (R0)