Confidence estimates of classification accuracy on new examples
Following recent results  showing the importance of the fat shattering dimension in explaining the beneficial effect of a large margin on generalization performance, the current paper investigates how the margin on a test example can be used to give greater certainty of correct classification in the distribution independent model. The results show that even if the classifier does not classify all of the training examples correctly, the fact that a new example has a larger margin than that on the misclassified examples, can be used to give very good estimates for the generalization performance in terms of the fat shattering dimension measured at a scale proportional to the excess margin. The estimate relies on a sufficiently large number of the correctly classified training examples having a margin roughly equal to that used to estimate generalization, indicating that the corresponding output values need to be ‘well sampled’. If this is not the case it may be better to use the estimate obtained from a smaller margin.
Unable to display preview. Download preview PDF.
- 1.Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, David Haussler, “Scale-sensitive Dimensions, Uniform Convergence, and Learnability,” in Proceedings of the Conference on Foundations of Computer Science (FOCS), 1993. Also to appear in Journal of the ACM.Google Scholar
- 2.Martin Anthony and John Shawe-Taylor, “A Result of Vapnik with Applications,” Discrete Applied Mathematics, 47, 207–217, (1993).Google Scholar
- 3.Peter Bartlett, “The Sample Complexity of Pattern Classification with Neural Networks: the Size of the Weights is More Important than the Size of the Network,” Technical Report, Department of Systems Engineering, Australian National University, May 1996.Google Scholar
- 4.Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” pages 144–152 in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh ACM, (1992)Google Scholar
- 5.D.J.C. MacKay, Bayesian Methods for Adaptive Models, Ph.D. Thesis, Caltech, 1991.Google Scholar
- 6.John Shawe-Taylor, Peter Bartlett, Robert Williamson and Martin Anthony, Structural Risk Minimization over Data-Dependent Hierarchies, NeuroCOLT Technical Report, NC-TR-96-51.Google Scholar
- 7.Vladimir N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, 1982.Google Scholar
- 8.Vladimir N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.Google Scholar