Multi-layer Perceptrons

Kruse, Rudolf; Mostaghim, Sanaz; Borgelt, Christian; Braune, Christian; Steinbrecher, Matthias

doi:10.1007/978-3-030-42227-1_5

Rudolf Kruse⁸,
Sanaz Mostaghim ORCID: orcid.org/0000-0002-9917-5227⁸,
Christian Borgelt⁹,
Christian Braune⁸ &
…
Matthias Steinbrecher¹⁰

Part of the book series: Texts in Computer Science ((TCS))

3277 Accesses
28 Citations

Abstract

Having described the structure, the operation and the training of (artificial) neural networks in a general fashion in the preceding chapter, we turn in this and the subsequent chapters to specific forms of (artificial) neural networks. We start with the best known and most widely used form, the so-called multi-layer perceptron (MLP), which is closely related to the networks of threshold logic units. They exhibit a strictly layered structure and may employ other activation functions than a step at a crisp threshold.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.pytorch.org and https://www.tensorflow.org, respectively.
2.
Conservative logic is a mathematical model for computations and computational powers of computers, in which the fundamental physical principles that govern computing machines are explicitly taken into account. Among these principles are, for instance, that the speed with which information can travel as well as the amount of information that can be stored in the state of a finite system are both finite [6].
3.
In the following, we assume implicitly that the output function of all neurons is the identity. Only the activation functions are changed to obtain certain beneficial properties.
4.
Note that there are not even any anomalies at the step borders, because the step border is included in the step to the right of it, but excluded from the step to the left of it.
5.
Note that this approach is not easily transferred to functions with multiple arguments. For this to be possible, the influences of the two or more inputs have to be independent in a certain sense.
6.
Note, however, that with this approach the sum of squared errors is minimized in the transformed space (coordinates \(x' = \ln x\) and \(y' = \ln y\)), but this does not imply that it is also minimized in the original space (coordinates x and y). Nevertheless, this approach usually yields very good results or at least an initial solution that may then be improved by other means.
7.
Note again that with this procedure the sum of squared errors is minimized in the transformed space (coordinates x and \(z = \ln \big (\frac{Y-y}{y}\big )\)), but this does not imply that it is also minimized in the original space (coordinates x and y), cf. the preceding footnote.
8.
Unless the output function is not differentiable. However, we usually assume (implicitly) that the output function is the identity and thus does not introduce any problems.
9.
In statistics this is often also referred to as logistic regression, which can cause some confusion.
10.
Nevertheless the cross entropy is sometimes also defined via a natural logarithm.
11.
As for the sum of squared errors, applying the Newton–Raphson method to find a root of the gradient is an alternative, which we do not consider here.
12.
In order to avoid this factor right from the start, the error of an output neuron is sometimes defined as \(e_u^{(l)} = \frac{1}{2} \big (o_u^{(l)} - \text {out}_u^{(l)}\big )^2\). In this way the factor 2 simply cancels in the derivation.
13.
Note that the bias value \(\theta _u\) is already contained in the extended weight vector.
14.
In contrast to the variance about the mean, for which the squares of the differences to the mean are summed, the squares of the values themselves are summed for the raw variance.
15.
The added \(\epsilon \) is only a technical trick for an implementation as it rules out a division by zero.
16.
A side issue, which one should be careful about in an implementation, is that the two \(\epsilon \) (which serve the purpuse of preventing a vanishing step size as well as a division by zero) appear inside the square root, while in the preceding methods \(\epsilon \) was added outside the square root. This is not a typographic error, but can be found like this in the original definitions of these methods.
17.
“Xavier” is the given name of the first author of the paper that proposed this method.
18.
“He” is the family name, “Kaiming” the given name of the first author of the proposing paper.

References

Anderson ES (1935) The Irises of the Gaspé Peninsula. Bull Am Iris Soc 59:2–5. American Iris Society, Philadelphia, PA, USA
Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. IEEE Press, Piscataway, NJ, USA
Google Scholar
Bengio Y, LeCun Y, Hinton G (2015) Deep learning. Nature 521:436–444. Nature Publishing Group, London, United Kingdom
Google Scholar
Dozat T (2016) Incorporating nesterov momentum into adam. In: Workshop Proceedings of 4th international conference on learning representations (ICLR 2016, San Juan, Puerto Rico). openreview.net
Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159. JMLR Publishers
Google Scholar
Fredkin E, Toffoli T (1982) Conservative logic. Int J Theor Phys 21(3/4):219–253. Plenum Press, New York, NY, USA
Google Scholar
Fahlman SE (1988) An empirical study of learning speed in backpropagation networks. In: [Touretzky et al. 1988]
Google Scholar
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics 7(2):179–188. Wiley, Chichester, United Kingdom
Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings 13th international conference on artificial intelligence and statistics (AISTATS 2010, Chia Laguna Resort, Sardinia, Italy), pp 249–256. PMLR Press
Google Scholar
Goodfellow I, Bengio Y, Courville A (2015) Deep learning. MIT Press, Cambridge, MA, USA
MATH Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings IEEE international conference on computer vision (ICCV 2015, Las Condes, Chile), pp 1026–1034. IEEE Press, Piscataway, NJ, USA
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Learning deep residual, for image recognition. In: Proceedings IEEE conference on computer vision and pattern recognition (CVPR, (2016) Las vegas, NV), pp 770–778. IEEE Press, Piscataway, NJ, USA
Google Scholar
Hochreiter S (1991) Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Technische Universit"at M"unchen, Germany
Google Scholar
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: [Kremer and Kolen 2001]
Google Scholar
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257. Elsevier Science, Amsterdam, Netherlands
Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of 32nd international conference machine learning (ICML 2015, Lille, France), pp 448–456. PMLR Press
Google Scholar
Jakobs RA (1988) Increased rates of convergence through learning rate adaption. Neural Netw 1:295–307. Pergamon Press, Oxford, United Kingdom
Google Scholar
Jang H, Park A, Jung K (2008) Using neural network implementation, CUDA and OpenMP. In: Proceedings of digital image computing: techniques and applications (DICTA, (2008) Canberra, Australia), pp 155–161. IEEE Press, Piscataway, NJ, USA
Google Scholar
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: Proceedings of 3rd international conference on learning representations (ICLR 2015, San Diego, CA). openreview.net
Google Scholar
Kremer SC, Kolen JF (eds) (2001) A field guide to dynamical recurrent neural networks. IEEE Press, Piscataway, NJ, USA
Google Scholar
McCluskey EJ, Jr (1956) Minimization of boolean functions. Bell Syst Techl J 35(6):1417–1444. American Telephone and Telegraph Company, New York, NY, USA
Google Scholar
Nesterov YE (1983) A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Soviet Math Doklady 27(2):372–376. Akademia Nauk USSR, Moscow, USSR
Google Scholar
Pinkus A (1999) Approximation theory of the mlp model in neural networks. Acta Numerica 8:143-196. Cambridge University Press, Cambridge, United Kingdom
Google Scholar
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17. Dorodnitsyn Computing Centre, Moscow, USSR
Google Scholar
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C—the art of scientific computing, 2nd edn. Cambridge University Press, Cambridge, United Kingdom
Google Scholar
Quine WV (1952) The problem of simplifying truth functions. Am Math Mon 59(8):521–531. Mathematical Association of America, Washington, DC, USA
Google Scholar
Quine WV (1955) A way to simplify truth functions. Am Math Mon 62(9):627–631. Mathematical Association of America, Washington, DC, USA
Google Scholar
Riedmiller M, Braun H (1992) Rprop—A Fast Adaptive Learning Algorithm. Technical Report, University of Karlsruhe, Karlsruhe, Germany
Google Scholar
Riedmiller M, Braun H (1993) A direct adaptive method for faster backpropagation learning: the RPROP algorithm. In: International conference on neural networks (ICNN-93, San Francisco, CA). IEEE Press, Piscataway, NJ, USA, pp 586–591
Google Scholar
Rosenbrock HH (1960) An automatic method for finding the greatest or least value of a function. Comput J 3(3):175–184. Oxford University Press, Oxford, United Kingdom
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536
Google Scholar
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958. MIT Press, Cambridge, MA, USA
Google Scholar
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Netw Mach Learn 4:26–31
Google Scholar
Tollenaere T (1990) SuperSAB: fast adaptive backpropagation with good scaling properties. Neural Netw 3:561–573
Google Scholar
Touretzky D, Hinton G, Sejnowski T (eds) Proceedings of the connectionist models summer school (Carnegie Mellon University). Morgan Kaufman, San Mateo, CA, USA
Google Scholar
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408. MIT Press, Cambridge, MA, USA
Google Scholar
Werbos PJ (1974) Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD Thesis, Harvard University, Cambridge, MA, USA
Google Scholar
Zeiler MD (2012) AdaDelta: an adaptive learning rate method. Comput Res Repository (CoRR)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Otto-von-Guericke-University, Magdeburg, Sachsen-Anhalt, Germany
Rudolf Kruse, Sanaz Mostaghim & Christian Braune
Department of Artificial Intelligence and Human Interfaces, Paris-Lodron-University Salzburg, Salzburg, Austria
Christian Borgelt
D4L data4life gGmbH, Potsdam, Germany
Matthias Steinbrecher

Authors

Rudolf Kruse
View author publications
You can also search for this author in PubMed Google Scholar
Sanaz Mostaghim
View author publications
You can also search for this author in PubMed Google Scholar
Christian Borgelt
View author publications
You can also search for this author in PubMed Google Scholar
Christian Braune
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Steinbrecher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanaz Mostaghim .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kruse, R., Mostaghim, S., Borgelt, C., Braune, C., Steinbrecher, M. (2022). Multi-layer Perceptrons. In: Computational Intelligence. Texts in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-42227-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-42227-1_5
Published: 27 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-42226-4
Online ISBN: 978-3-030-42227-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics