Abstract
Pattern recognition (or classification or discrimination) is about guessing or predicting the unknown class of an observation. An observation is a collection of numerical measurements, represented by a d-dimensional vector x. The unknown nature of the observation is called a class. It is denoted by y and takes values in the set {0,1}. (For simplicity, we restrict our attention to binary classification.) In pattern recognition, one creates a function g(x): R d → {0, 1} which represents one’s guess of y given x. The mapping g is called a classifier. A classifier errs on x if g(x) ≠ y.
Keywords
- Independent Random Variable
- Pattern Classification
- Empirical Process
- Empirical Risk
- Concentration Inequality
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
General
M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, Cambridge, 1999.
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.
L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, 1996.
V.N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.
V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.
Concentration for sums of independent random variables
G. Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57: 33–45, 1962.
S.N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.
H. Chernoff. A measure of asymptotic efficiency of tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23: 493–507, 1952.
T. Hagerup and C. Rüb. A guided tour of Chernoff bounds. Information Processing Letters, 33: 305–308, 1990.
C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics 1989, pages 148–188. Cambridge University Press, Cambridge, 1989.
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58: 13–30, 1963.
R.M. Karp. Probabilistic Analysis of Algorithms. Class Notes, University of California, Berkeley, 1988.
M. Okamoto. Some inequalities relating to the partial sum of binomial probabilities. Annals of the Institute of Statistical Mathematics, 10: 29–35, 1958.
Concentration
K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68: 357–367, 1967.
S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications in random combinatorics and learning. Random Structures and Algorithms, 16: 277292, 2000.
L. Devroye. Exponential inequalities in nonparametric estimation. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 31–44. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.
J. H. Kim. The Ramsey number R(3, t) has order of magnitude t2/ log t. Random Structures and Algorithms, 7: 173–207, 1995.
M. Ledoux. On Talagrand’s deviation inequalities for product measures. ESAIM: Proba-bility and Statistics, 1, 63–87, (1996).
K. Marton. A simple proof of the blowing-up lemma. IEEE Transactions on Information Theory, 32: 44546, 1986.
K. Marton. Bounding J-distance by informational divergence: a way to prove measure concentration. Annals of Probability, to appear: 0–0, 1996.
K. Marton. A measure concentration inequality for contracting Markov chains Geometric and Functional Analysis, 6:556–571, 1996. Erratum: 7: 609–613, 1997.
P. Massart. About the constant in Talagrand’s concentration inequalities from empirical processes. Annals of Probability, 28: 863–884, 2000.
W. Rhee and M. Talagrand. Martingales, inequalities, and NP-complete problems. Mathematics of Operations Research, 12: 177–181, 1987.
J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Annals of Statistics, 14: 753–758, 1986.
M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. I.H.E.S. Publications Mathématiques, 81: 73–205, 1996.
M. Talagrand. New concentration inequalities in product spaces. Invent. Math. 126: 505–563, 1996.
M. Talagrand. A new look at independence. Annals of Probability,24:0–0, 1996. special invited paper.
VC theory
K. Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm. Annals of Probability, 4: 1041–1067, 1984.
M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied Mathematics, 47: 207–217, 1993.
P. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from their means. Statistics and Probability Letters, 44: 55–62, 1999.
L. Breiman. Bagging predictors. Machine Learning, 24: 123–140, 1996.
Devroye, L. Bounds for the uniform deviation of empirical measures. Journal of Multivariate Analysis, 12: 72–79, 1982.
A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82: 247–261, 1989.
Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121: 256–285, 1995.
E. Giné and J. Zinn. Some limit theorems for empirical processes. Annals of Probability, 12: 929–989, 1984.
D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100: 78–150, 1992.
V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers, Annals of Statistics, 30, 2002.
M. Ledoux and M. Talagrand. Probability in Banach Space, Springer-Verlag, New York, 1991.
G. Lugosi. Improved upper bounds for probabilities of uniform deviations. Statistics and Probability Letters, 25: 71–77, 1995.
D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, New York, 1984.
R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods, Annals of Statistics, 26: 1651–1686, 1998.
R.E. Schapire. The strength of weak learnability. Machine Learning, 5: 197–227, 1990.
M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22: 28–76, 1994.
S. Van de Geer. Estimating a regression function. Annals of Statistics, 18: 907–924, 1990.
V.N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.
V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
V.N. Vapnik and A.Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16: 264–280, 1971.
V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.
A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes, Springer-Verlag, New York, 1996.
Shatter coefficients, VC dimension
P. Assouad, Sur les classes de Vapnik-Chervonenkis, C.R. Acad. Sci. Paris, vol. 292, Sér.I, pp. 921–924, 1981.
T. M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers, vol. 14, pp. 326–334, 1965.
R. M. Dudley, Central limit theorems for empirical measures, Annals of Probability, vol. 6, pp. 899–929, 1978.
R. M. Dudley, Balls in R k do not cut all subsets of k + 2 points, Advances in Mathematics, vol. 31 (3), pp. 306–308, 1979.
P. Franid, On the trace of finite sets, Journal of Combinatorial Theory, Series A, vol. 34, pp. 41–45, 1983.
D. Haussier, Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension, Journal of Combinatorial Theory, Series A, vol. 69, pp. 217–232, 1995.
N. Sauer, On the density of families of sets, Journal of Combinatorial Theory Series A, vol. 13, pp. 145–147, 1972.
L. Schläffli, Gesammelte Mathematische Abhandlungen, Birkhäuser-Verlag, Basel, 1950.
S. Shelah, A combinatorial problem: stability and order for models and theories in infinity languages, Pacific Journal of Mathematics, vol. 41, pp. 247–261, 1972.
J. M. Steele, Combinatorial entropy and uniform limit laws, Ph.D. dissertation, Stanford University, Stanford, CA, 1975.
J. M. Steele, Existence of submatrices with all possible columns, Journal of Combinatorial Theory, Series A, vol. 28, pp. 84–88, 1978.
R. S. Wenocur and R. M. Dudley, Some special Vapnik-Chervonenkis classes, Discrete Mathematics, vol. 33, pp. 313–318, 1981.
Lower bounds
A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning, vol. 30, 31–56, 1998.
P. Assouad. Deux remarques sur l’estimation. Comptes Rendus de l’Académie des Sciences de Paris, 296: 1021–1024, 1983.
L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 65: 181–237, 1983.
L. Birgé. On estimating a density using Hellinger distance and some other strange facts. Probability Theory and Related Fields, 71: 271–291, 1986.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36: 929–965, 1989.
L. Devroye and G. Lugosi. Lower bounds in pattern recognition and learning. Pattern Recognition, 28: 1011–1018, 1995.
A. Ehrenfeucht, D. Haussier, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82: 247–261, 1989.
D. Haussier, N. Littlestone, and M. Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115: 248–292, 1994.
E. Mammen, A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27: 1808–1829, 1999.
D. Schuurmans. Characterizing rational versus exponential learning curves. In Computational Learning Theory: Second European Conference. EuroCOLT’95, pages 272–286. Springer Verlag, 1995.
V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow, 1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Verlag, Berlin, 1979.
S. Geman and C.R. Hwang. Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10: 401–414, 1982.
Complexity regularization
H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19: 716–723, 1974.
A.R. Barron. Logically smooth density estimation. Technical Report TR 56, Department of Statistics, Stanford University, 1985.
A.R. Barron. Complexity regularization with application to artificial neural networks. In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages 561576. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.
A.R. Barron, L. Birgé, and R Massart. Risk bounds for model selection via penalization. Probability Theory and Related fields, 113: 301–413, 1999.
A.R. Barron and T.M. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37: 1034–1054, 1991.
P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44 (2): 525–536, March 1998.
P. Bartlett, S. Boucheron, and G. Lugosi, Model selection and error estimation. Proceedings of the 13th Annual Conference on Computational Learning Theory, ACM Press, pp. 286–297, 2000.
L. Birgé and P. Massart. From model selection to adaptive estimation. In E. Torgersen D. Pollard and G. Yang, editors, Festschrift for Lucien Le Cam: Research papers in Probability and Statistics, pages 55–87. Springer, New York, 1997.
L. Birgé and R Massart. Minimum contrast estimators on sieves: exponential bounds and rates of convergence. Bernoulli, 4: 329–375, 1998.
Y. Freund. Self bounding learning algorithms. Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 247–258, 1998.
A.R. Gallant. Nonlinear Statistical Models. John Wiley, New York, 1987.
M. Kearns, Y. Mansour, A.Y. Ng, and D. Ron. An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Workshop on Computational Learning Theory, pages 21–30. Association for Computing Machinery, New York, 1995.
A. Krzyzak and T. Linder. Radial basis function networks and complexity regularization in function learning. IEEE Transactions on Neural Networks, 9: 247–256, 1998.
G. Lugosi and A. Nobel. Adaptive model selection using empirical complexities. Annals of Statistics, vol. 27, no. 6, 1999.
G. Lugosi and K. Zeger. Nonparametric estimation via empirical risk minimization IEEE Transactions on Information Theory, 41: 677–678, 1995.
G. Lugosi and K. Zeger. Concept learning using complexity regularization. IEEE Transactions on Information Theory, 42: 48–54, 1996.
C.L. Mallows. Some comments on cp. IEEE Technometrics, 15: 661–675, 1997.
P. Massart. Some applications of concentration inequalities to statistics. Annales de la faculté des sciences de l’université de Toulouse, Mathématiques, série 6, IX: 245–303, 2000.
R. Meir. Performance bounds for nonlinear time series prediction. In Proceedings of the Tenth Annual ACM Workshop on Computational Learning Theory, page 122–129. Association for Computing Machinery, New York, 1997.
D.S. Modha and E. Masry. Minimum complexity regression estimation with weakly de-pendent observations. IEEE Transactions on Information Theory, 42: 2133–2145, 1996.
J. Rissanen. A universal prior for integers and estimation by minimum description length. Annals of Statistics, 11: 416–431, 1983.
G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6: 461–464, 1978.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44 (5): 1926–1940, 1998.
X. Shen and W.H. Wong. Convergence rate of sieve estimates. Annals of Statistics, 22: 580–615, 1994.
Y. Yang and A.R. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, to appear, 1997.
Y. Yang and A.R. Barron. An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Wien
About this chapter
Cite this chapter
Lugosi, G. (2002). Pattern Classification and Learning Theory. In: Györfi, L. (eds) Principles of Nonparametric Learning. International Centre for Mechanical Sciences, vol 434. Springer, Vienna. https://doi.org/10.1007/978-3-7091-2568-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-7091-2568-7_1
Publisher Name: Springer, Vienna
Print ISBN: 978-3-211-83688-0
Online ISBN: 978-3-7091-2568-7
eBook Packages: Springer Book Archive