Statistical modelling of artificial neural networks using the multi-layer perceptron
- 160 Downloads
Multi-layer perceptrons (MLPs), a common type of artificial neural networks (ANNs), are widely used in computer science and engineering for object recognition, discrimination and classification, and have more recently found use in process monitoring and control. “Training” such networks is not a straightforward optimisation problem, and we examine features of these networks which contribute to the optimisation difficulty.
Although the original “perceptron”, developed in the late 1950s (Rosenblatt 1958, Widrow and Hoff 1960), had a binary output from each “node”, this was not compatible with back-propagation and similar training methods for the MLP. Hence the output of each node (and the final network output) was made a differentiable function of the network inputs. We reformulate the MLP model with the original perceptron in mind so that each node in the “hidden layers” can be considered as a latent (that is, unobserved) Bernoulli random variable. This maintains the property of binary output from the nodes, and with an imposed logistic regression of the hidden layer nodes on the inputs, the expected output of our model is identical to the MLP output with a logistic sigmoid activation function (for the case of one hidden layer).
We examine the usual MLP objective function—the sum of squares—and show its multi-modal form and the corresponding optimisation difficulty. We also construct the likelihood for the reformulated latent variable model and maximise it by standard finite mixture ML methods using an EM algorithm, which provides stable ML estimates from random starting positions without the need for regularisation or cross-validation. Over-fitting of the number of nodes does not affect this stability. This algorithm is closely related to the EM algorithm of Jordan and Jacobs (1994) for the Mixture of Experts model.
We conclude with some general comments on the relation between the MLP and latent variable models.
Unable to display preview. Download preview PDF.
- Aitkin M., Anderson D., and Hinde J. 1981. Statistical modelling of data on teaching styles (with discussion). Journal of the Royal Statistical Society A 144: 419–461.Google Scholar
- Anderson J. and Rosenfeld E. (Eds.) 1989. Neurocomputing: Foundations of Research. MIT Press, Cambridge, Massachusetts.Google Scholar
- Bishop C.M. 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.Google Scholar
- Dempster A., Laird N., and Rubin D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39: 1–38.Google Scholar
- Dybowski R. 1998. Classification of incomplete feature vectors by radial basis function networks. Pattern Recognition Letters 11(14): 1257–1264.Google Scholar
- Foxall R. 2001. Statistical modelling of artificial neural networks. Ph.D. thesis, University of Newcastle-upon-Tyne.Google Scholar
- Jacobs R.A., Jordan M.I., Nowlan S.J., and Hinton G.E. 1991. Adaptive mixtures of local experts. Neural Computation 3: 79–87.Google Scholar
- Jordan M.I. and Jacobs R.A. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6: 181–214.Google Scholar
- MacKay D.J. 1992a. Bayesian interpolation. Neural Computation 4(3): 415–447.Google Scholar
- MacKay D.J. 1992b. A practical bayesian framework for backprop networks. Neural Computation 4: 448–472.Google Scholar
- MacKay D.J. 1995. Probable networks and plausible predictions-A review of practical bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6: 469–505.Google Scholar
- MacKay D.J. 1999. Comparison of approximate methods for handling hyperparameters. Neural Computation 11: 1035–1068.Google Scholar
- MATLAB Reference Guide 1998. The MathWorks, Inc., Natick, Massachusetts.Google Scholar
- Ripley B.D. 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge.Google Scholar
- Rosenblatt F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65: 386–408. Reprinted in Anderson and Rosenfeld (1989).Google Scholar
- Sarle W.S. 1995. Stopped training and other remedies for overfitting. In: Proceedings of the 27th Symposium on the Interface, pp. 352–360.Google Scholar
- Schmidt G., Mattern R., and Schueler F. 1981. Biomechanical investigation to determine physical and traumatological differentiation criteria for the maximum load capacity of head and vertebral column with and without protective helmet under the effects of impact. Tech. Rep., Institut fur Rechtsmedizin, University of Heidelberg, West Germany.Google Scholar
- Silverman B. 1985. Some aspects of the spline smoothing approach to non-parametric regression curve fitting. Journal of the Royal Statistical Society B 47(1): 1–52.Google Scholar
- Tresp V., Ahmad S., and Neuneier R. 1994. Training neural networks with deficient data. Advances in Neural Information Processing Systems 6: 128–135.Google Scholar
- Widrow B. and Hoff M. 1960. Adaptive switching circuits. IRE WESCON Convention Record 96-104. Reprinted in Anderson and Rosenfeld (1989).Google Scholar