Automatic Selection of a Discrimination Rule Based upon Minimization of the Empirical Risk

Devroye, Luc

doi:10.1007/978-3-642-83069-3_3

Luc Devroye³

Part of the book series: NATO ASI Series ((NATO ASI F,volume 30))

200 Accesses

Abstract

A discrimination rule is chosen from a possibly infinite collection of discrimination rules based upon the minimization of the observed error in a test sample. For example, the collection could include all k nearest neighbor rules (for all k), all linear discriminators, and all kernel-based rules (for all possible choices of the smoothing parameter). We do not put any restrictions on the collection.

We study how close the probability of error of the selected rule is to the (unknown) minimal probability of error over the entire collection. If both training sample and test sample have n observations, the expected value of the difference is shown to be \(O\left( {\sqrt {\log (n)/n} } \right)\) for many reasonable collections, such as the one mentioned above. General inequalities governing this error are given which are of a combinatorial nature, i.e., they are valid for all possible distributions of the data, and most practical collections of rules.

The theory is based in part on the work of Vapnik and Chervonenkis regarding minimization of the empirical risk. For all proofs, technical details, and additional examples, we refer to Devroye (1986).

As a by-product, we establish that for some nonparametric rules, the probability of error of the selected rule converges at the optimal rate (achievable within the given collection of non-parametric rules) to the Bayes probability of error, and this without actually knowing the optimal rate of convergence to the Bayes probability of error.

Research of the author was sponsored by NSERC Grant A3456 and by FCAR Grant EQ-16T8

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

O. Bashkirov, E.M. Braverman, and I.E. Muchnik, “Potential function algorithms for pattern recognition learning machines,” Automation and Remote Control, vol. 25, pp. 692–695, 1964.
Google Scholar
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classification and Regression Trees, Wadsworth International, Belmont, CA., 1984.
MATH Google Scholar
T. Cacoullos, “Estimation of a multivariate density,” Annals of the Institute of Statistical Mathematics, vol. 18, pp. 179–190, 1965.
Article MathSciNet Google Scholar
R.G. Casey and G. Nagy, “Decision tree design using a probabilistic model,” IEEE Transactions on Information Theory, vol. IT-30, pp. 93–99, 1984.
Article Google Scholar
T.M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Transactions on Electronic Computers, vol. EC-14, pp. 326–334, 1965.
Google Scholar
T.M. Cover, “Learning in pattern recognition,” in Methodologies of Pattern Recognition, ed. S. Watanabe, pp. 111–132, Academic Press, New York, N.Y., 1969.
Google Scholar
T.M. Cover and T.J. Wagner, “Topics in statistical pattern recognition,” Communication and Cybernetics, vol. 10, pp. 15–46, 1975.
Article Google Scholar
L. Devroye, “A universal k-nearest neighbor procedure in discrimination,” in Proceedings of the 1978 IEEE Computer Society Conference on Pattern Recognition and Image Processing, pp. 142–147, 1978.
Google Scholar
L. Devroye and T.J. Wagner, “Distribution-free performance bounds for potential function rules,” IEEE Transactions on Information Theory, vol. IT-25, pp. 601–604, 1979.
Article MathSciNet MATH Google Scholar
L. Devroye and T.J. Wagner, “Distribution-free performance bounds with the re- substitution error estimate,” IEEE Transactions on Information Theory, vol. IT-25, pp. 208–210, 1979.
Article MathSciNet MATH Google Scholar
L. Devroye and T.J. Wagner, “Distribution-free inequalities for the deleted and holdout error estimates,” IEEE Transactions on Information Theory, vol. IT-25, pp. 202–207, 1979.
Article MathSciNet MATH Google Scholar
L. Devroye and T.J. Wagner, “Distribution-free consistency results in non- parametric discrimination and regression function estimation,” Annals of Statistics, vol. 8, pp. 231–239, 1980.
Article MathSciNet MATH Google Scholar
L. Devroye, “Bounds for the uniform deviation of empirical measures,” Journal of Multivariate Analysis, vol. 12, pp. 72–79, 1982.
Article MathSciNet MATH Google Scholar
L. Devroye and L. Gyorfi, Nonparametric Density Estimation: the L ₁ View, John Wiley, New York, 1985.
MATH Google Scholar
L. Devroye, “Automatic pattern recognition: a study of the probability of error,” Technical Report, School of Computer Science, McGill University, 1986.
Google Scholar
R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley, New York, N.Y., 1973.
MATH Google Scholar
B. Efron, “Bootstrap methods: another look at the jackknife,” Annals of Statistics, vol. 7, pp. 1–26, 1979.
Article MathSciNet MATH Google Scholar
B. Efron, “Estimating the error rate of a prediction rule: improvement on cross validation,” Journal of the American Statistical Association, vol. 78, pp. 316–331, 1983.
Article MathSciNet MATH Google Scholar
L. Feinholz, “Estimation of the performance of partitioning algorithms in pattern classification,” M.Sc.Thesis, Department of Mathematics, McGill University, Montreal, 1979.
Google Scholar
E. Fix and J.L. Hodges, “Discriminating analysis, nonparametric discrimination, consistency properties,” Report 21–49-004, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.
Google Scholar
E. Fix and J.L. Hodges, “Discriminatory analysis: small sample performance,” Report 21–49-004, USAF School of Aviation Medicine, Randolph Field, Texas, 1952.
Google Scholar
N. Glick, “Sample-based classification procedures derived from density estimators,” Journal of the American Statistical Association, vol. 67, pp. 116–122, 1972.
Article MATH Google Scholar
N. Glick, “Sample-based classification procedures related to empiric distributions,” Transactions on Information Theory, vol. IT-22, pp. 454–461, 1976.
Article MathSciNet MATH Google Scholar
N. Glick, “Additive estimators for probabilities of correct classification,” Pattern Recognition, vol. 10, pp. 211–222, 1978.
Article MATH Google Scholar
W. Greblicki, A. Krzyzak, and M. Pawlak, “Distribution-free pointwise consistency of kernel regression estimate,” Annals of Statistics, vol. 12, pp. 1570–1575, 1984.
Article MathSciNet MATH Google Scholar
L.N. Kanal, “Pattern in pattern recognition,” IEEE Transactions on Information Theory, vol. IT-20, pp. 697–722, 1974.
Article MathSciNet MATH Google Scholar
Y.K. Lin and K.S. Fu, “Automatic classification of cervical cells using a binary tree classifier,” Pattern Recognition, vol. 16, pp. 69–80, 1983.
Article Google Scholar
A.L. Lunts and V.L. Brailosvsky, “Evaluation of attributes obtained in statistical decision rules,” Engineering Cybernetics, vol. 5, pp. 98–109, 1967.
Google Scholar
P. Massart, “Vitesse de convergence dans le théorème de la limite centrale pour le processus empirique,” Ph.D. Dissertation, Université de Paris-Sud, Orsay, France, 1983.
Google Scholar
W. Meisel, “Potential functions in mathematical pattern recognition,” IEEE Transactions on Computers, vol. C-18, pp. 911–918, 1969.
Article MathSciNet MATH Google Scholar
R.A. Olshen, “Comments on a paper by C.J. Stone,” Annals of Statistics, vol. 5, pp. 632–633, 1977.
Google Scholar
E. Parzen, “On the estimation of a probability density function and the mode,” Annals of Mathematical Statistics, vol. 33, pp. 1065–1076, 1962.
Article MathSciNet MATH Google Scholar
H.J. Payne and W.S. Meisel, “An algorithm for constructing optimal binary decision trees,” IEEE Transactions on Computers, vol. C-26, pp. 905–916, 1977.
Article MathSciNet MATH Google Scholar
M. Rosenblatt, “Remark on some nonparametric estimates of a density function,” Annals of Mathematical Statistics, vol. 27, pp. 832–837, 1956.
Article MathSciNet MATH Google Scholar
G. Sebestyen,Decision Making Processes in Pattern Recognition, Macmillan, New York, N.Y., 1962.
Google Scholar
I.K. Sethi and B. Chatterjee, “Efficient decision tree design for discrete variable pattern recognition problems,” Pattern Recognition, vol. 9, pp. 197–206, 1977.
Article Google Scholar
I.K. Sethi and G.P.R. Sarvarayudu, “Hierarchical classifier design using mutual Information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PA MI-4, pp. 441–445, 1981.
Article Google Scholar
C. Spiegelman and J. Sacks, “Consistent window estimation in nonparametric regression,” Annals of Statistics, vol. 8, pp. 240–246, 1980.
Article MathSciNet MATH Google Scholar
C.J. Stone, “Consistent nonparametric regression,” Annals of Statistics, vol. 8, pp. 1348–1360, 1977.
Article Google Scholar
M. Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society, vol. 36, pp. 111–147, 1974.
MATH Google Scholar
G.T. Toussaint, “Bibliography on estimation of misclassification,” IEEE Transactions on Information Theory, vol. IT-20, pp. 474–479, 1974.
Article MathSciNet Google Scholar
J. VanRyzin, “Bayes risk consistency of classification procedures using density estimation,” Sankhya Series A, vol. 28, pp. 161–170, 1966.
MathSciNet Google Scholar
V.N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer- Verlag, 1982.
MATH Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis, “Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data,” Automation and Remote Control, vol. 32, pp. 207–217, 1971.
MathSciNet Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability and its Applications, vol. 16, pp. 264–280, 1971.
Article MATH Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis, “Ordered risk minimization. I,” Automation and Remote Control, vol. 35, pp. 1226–1235, 1974.
MathSciNet MATH Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis, “Ordered risk minimization. II,” Automation and Remote Control, vol. 35, pp. 1043–1412, 1974.
Google Scholar
V.N. Vapnik and A. Ya. Chervonenkis, Theory of Pattern Recognition, Nauka, Moscow, 1974.
MATH Google Scholar
V. N. Vapnik and A. Ya. Chervonenkis, “Necessary and sufficient conditions for the uniform convergence of means to their expectations,” Theory of Probability and its Applications, vol. 26, pp. 532–553, 1981.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, McGill University, 805 Sherbrooke Street West, H3A 2K6, Montréal, Canada
Luc Devroye

Authors

Luc Devroye
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Philips Research Laboratory, Avenue Em. Van Becelaere 2, Box 8, B-1170, Brussels, Belgium
Pierre A. Devijver
Department of Electronic and Electrical Engineering, University of Surrey, GU2 5XH, Guildford, UK
Josef Kittler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Devroye, L. (1987). Automatic Selection of a Discrimination Rule Based upon Minimization of the Empirical Risk. In: Devijver, P.A., Kittler, J. (eds) Pattern Recognition Theory and Applications. NATO ASI Series, vol 30. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-83069-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-83069-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-83071-6
Online ISBN: 978-3-642-83069-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics