Optimizing the Evidence
Since Bayesian learning for neural networks was introduced by MacKay it was applied to real world problems with varying success. Despite of the fact that Bayesian learning provides an elegant theory to prevent neural networks from overfitting, it is not as commonly used as it should be. In this paper we focus on two problems that arise in practice: (1) The evidence p(D|α) of the hyperparameter α does not monotonically increase during the learning process and (2) the correlation coefficient between the evidence and the generalization performance is usually positive but significantly different from 1. The latter problem is solved in practice by forming a committee of networks with reasonably high evidence, thus reducing the influence of outliers. Based on good choice of the prior of the hyperparameters, which was crucial for the convergence of the algorithm in our experiments, we exploit in the following the positive correlation between the evidence and the generalization performance by intertwining a search procedure with the iterative Bayesian learning algorithm. We will show that this restricts the training process to favorable areas of the search space, such that the influence of the non-monotonic curve of the evidence can be neglected and the resulting networks have both high evidence and good generalization behavior. The behavior of the algorithm is first visualized by using a simple but noisy classification task (two-dimensional) and then applied to a prediction system of the daily exchange rate of the US Dollar against the German Mark.
Unable to display preview. Download preview PDF.
- J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, 1980.Google Scholar
- C. M. Bishop. Neural Networks for Pattern Recognition. Oxford Pr., 1995.Google Scholar
- B. P. Carlin and T. A. Louis. Bayes and Empirical Bayes Methods for Data analysis. Chapman & Hall, 1996.Google Scholar
- S. Gutjahr. Improving the Determination of the Hyperparameters in Bayesian Learning. In Proccedings of the ACNN’ 98, Brisbane, 1998.Google Scholar
- H. Jeffreys. Theory of Probability. University Press Oxford, 1961.Google Scholar
- D. J. C. MacKay. Bayesian Interpolation. In Neural Computation, 1992.Google Scholar
- D. J. C. MacKay. Bayesian Methods for Neural Networks: Theory and Applications. In Course notes for Neural Network Summer School. Cambridge, 1995.Google Scholar
- T. Ragg and H. Braun. A Comparative Study of Neural Network Optimization Techniques. In Proceedings of the ICANNGA’ 97, Norwich, UK. Springer, 1997.Google Scholar
- T. Ragg and S. Gutjahr. Building High Performant Classifiers by Integrating Bayesian Learning, Mutual Information and Committee Techniques. In Lecture Notes in Computer Science 1156, ICANN’97, Lausanne. Springer, 1997.Google Scholar
- A. Stahlberger and M. Riedmiller. Fast Network Pruning and Feature Extraction by Removing Complete Units. In NIPS 9. MIT Press, 1997.Google Scholar
- H. H. Thodberg. Ace of bayes: Applications of Neural Networks with Pruning. Technical Report 1132E, Danish Meat Research Institute, 1993.Google Scholar