[1]

D. Allen, The relationship between variable selection and data augmentation and a method for prediction, Technometrics 16 (1974) 125–127.

MATHMathSciNetCrossRefGoogle Scholar[2]

N. Alon, S. Ben-David, N. Cesa-Bianchi and D. Haussler, Scale-sensitive-dimensions, uniform convergence, and learnability, in: *Symposium on Foundations of Computer Science*(1993).

[3]

S. Amari, A. Cichocki and H. Yang, A new learning algorithm for bling signal separation, in:

*Advances in Neural Information Processing System*(MIT Press, Cambridge, MA, 1995) pp. 757–763.

Google Scholar[4]

S.I. Amari, Natural gradient works efficiently in learning, Neural Comput. 10 (1998) 251–276.

CrossRefGoogle Scholar[5]

N. Aronszajn, Theory of reproducing kernels, Trans. Amer. Math. Soc. 686 (1950) 337–404.

MathSciNetCrossRefGoogle Scholar[6]

P. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important that the size of the network, IEEE Trans. Inform. Theory (1998).

[7]

P. Bartlett, P.M. Long and R.C. Williamson, Fat-shattering and the learnability of real-valued functions, J. Comput. Systems Sci. 52(3) (1996) 434–452.

MATHMathSciNetCrossRefGoogle Scholar[8]

P. Bartlett and J. Shawe-Taylor, Generalization performance of support vector machine and other patern classifiers, in:

*Advances in Kernel Methods-Support Vector Learning*, eds. C. Burges and B. Scholkopf (MIT Press, Cambridge, MA, 1998).

Google Scholar[9]

A.J. Bell and T. Sejnowski, An information maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159.

Google Scholar[10]

M. Bertero, Regularization methods for linear inverse problems, in:

*Inverse Problems*, ed. C.G. Talenti (Springer, Berlin, 1986).

Google Scholar[11]

M. Bertero, T. Poggio and V. Torre, Ill-posed problems in early vision, Proc. IEEE 76 (1988) 869–889.

CrossRefGoogle Scholar[12]

L. Bottou and V. Vapnik, Local learning algorithms, Neural Comput. 4(6) (1992) 888–900.

Google Scholar[13]

M.D. Buhmann, Multivariate cardinal interpolation with radial basis functions, Constr. Approx. 6 (1990) 225–255.

MATHMathSciNetCrossRefGoogle Scholar[14]

M.D. Buhmann, On quasi-interpolation with Radial Basis Functions, Numerical Analysis Reports DAMPT 1991/NA3, Department of Applied Mathematics and Theoretical Physics, Cambridge, England (March 1991).

Google Scholar[15]

C. Burges, A tutorial on support vector machines for pattern recognition, in:

*Data Mining and Knowledge Discovery*, Vol. 2 (Kluwer Academic, Boston, 1998).

Google Scholar[16]

O. Chapelle and V. Vapnik, Model selection for support vector machines, in: *Advances in Neural Information Processing Systems*(1999).

[17]

S. Chen, D. Donoho and M. Saunders, Atomic decomposition by basis pursuit, Technical Report 479, Department of Statistics, Stanford University (May 1995).

[18]

S. Chen, Basis Pursuit, Ph.D. thesis, Department of Statistics, Stanford University (November 1995).

[19]

V. Cherkassky and F. Mulier,

*Learning from Data: Concepts, Theory, and Methods*(Wiley, New York, 1998).

MATHGoogle Scholar[20]

J.A. Cochran,

*The Analysis of Linear Integral Equations*(McGraw-Hill, New York, 1972).

MATHGoogle Scholar[21]

R.R. Coifman and M.V. Wickerhauser, Entropy-based algorithms for best-basis selection, IEEE Trans. Inform. Theory 38 (1992) 713–718.

MATHCrossRefGoogle Scholar[22]

C. Cortes and V. Vapnik, Support vector networks, Machine Learning 20 (1995) 1–25.

Google Scholar[23]

R. Courant and D. Hilbert,

*Methods of Mathematical Physics*, Vol. 2 (Interscience, London, England, 1962).

MATHGoogle Scholar[24]

I. Daubechies,

*Ten Lectures on Wavelets*, CBMS-NSF Regional Conferences Series in Applied Mathematics (SIAM, Philadelphia, PA, 1992).

MATHGoogle Scholar[25]

C. de Boor, Quasi-interpolants and approximation power of multivariate splines, in:

*Computation of Curves and Surfaces*, eds. M. Gasca and C.A. Micchelli (Kluwer Academic, Dordrecht, Netherlands, 1990) pp. 313–345.

Google Scholar[26]

R.A. DeVore, Nonlinear approximation, Acta Numerica (1998) 51–150.

[27]

L. Devroye, L. Gy¨orfi and G. Lugosi,

*A Probabilistic Theory of Pattern Recognition*, Applications of Mathematics, Vol. 31 (Springer, New York, 1996).

Google Scholar[28]

R.O. Duda and P.E. Hart,

*Pattern Classification and Scene Analysis*(Wiley, New York, 1973).

MATHGoogle Scholar[29]

R.M. Dudley, A course on empirical processes, in: Lecture Notes in Mathematics, Vol. 1097 (1984) pp. 2–142.

MathSciNetGoogle Scholar[30]

R.M. Dudley, E. Gine and J. Zinn, Uniform and universal Glivenko-Cantelli classes, J. Theoret. Probab. 4 (1991) 485–510.

MATHMathSciNetCrossRefGoogle Scholar[31]

N. Dyn, Interpolation and approximation by radial and related functions, in:

*Approximation Theory VI*, eds. C.K. Chui, L.L. Schumaker and D.J. Ward (Academic Press, New York, 1991) pp. 211–234.

Google Scholar[32]

N. Dyn, I.R.H. Jackson, D. Levin and A. Ron, On multivariate approximation by integer translates of a basis function, Computer Sciences Technical Report 886, University of Wisconsin-Madison (November 1989).

Google Scholar[33]

N. Dyn, D. Levin and S. Rippa, Numerical procedures for surface fitting of scattered data by radial functions, SIAM J. Sci. Statist. Comput. 7(2) (1986) 639–659.

MATHMathSciNetCrossRefGoogle Scholar[34]

T. Evgeniou, L. Perez-Breva, M. Pontil and T. Poggio, Bounds on the generalization performance of kernel machines ensembles, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

[35]

T. Evgeniou and M. Pontil, A note on the generalization performance of kernel classifiers with margin, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

[36]

T. Evgeniou and M. Pontil, On the v-gamma-dimension for regression in reproducing kernel Hilbert spaces, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

[37]

F. Girosi, Models of noise and robust estimates, A.I. Memo 1287, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1991).

[38]

F. Girosi, An equivalence between sparse approximation and Support Vector Machines, Neural Comput. 10(6) (1998) 1455–1480.

CrossRefGoogle Scholar[39]

F. Girosi, M. Jones and T. Poggio, Regularization theory and neural networks architectures, Neural Comput. 7 (1995) 219–269.

Google Scholar[40]

F. Girosi, T. Poggio and B. Caprile, Extensions of a theory of networks for approximation and learning: outliers and negative examples, in:

*Advances in Neural Information Processings Systems*, Vol. 3, eds. R. Lippmann, J. Moody and D. Touretzky (Morgan Kaufmann, San Mateo, CA, 1991).

Google Scholar[41]

W. H¨ardle, *Applied Nonparametric Regression*, Econometric Society Monographs, Vol. 19 (Cambridge University Press, 1990).

[42]

G.F. Harpur and R.W. Prager, Development of low entropy coding in a recurrent network, Network 7 (1996) 277–284.

CrossRefGoogle Scholar[43]

T. Hastie and R. Tibshirani,

*Generalized Additive Models*, Monographs on Statistics and Applied Probability, Vol. 43 (Chapman and Hall, London, 1990).

Google Scholar[44]

S. Haykin,

*Neural Networks: A Comprehensive Foundation*(Macmillan, New York, 1994).

MATHGoogle Scholar[45]

H. Hochstadt,

*Integral Equations*, Wiley Classics Library (Wiley, New York, 1973).

MATHGoogle Scholar[46]

V.V. Ivanov,

*The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integral Equations*(Noordhoff International, Leiden, 1976).

MATHGoogle Scholar[47]

T. Jaakkola and D. Haussler, Probabilistic kernel regression models, in: *Proc. of Neural Information Processing Conference*(1998).

[48]

I.R.H. Jackson, Radial Basis Functions methods for multivariate approximation, Ph.D. thesis, University of Cambridge, UK (1988).

Google Scholar[49]

M. Kearns, Y. Mansour, A. Ng and D. Ron, An experimental and theoretical comparison of model selection methods, in: *Proceedings of the 8th Annual ACM Conference on Computational Learning Theory*(1995).

[50]

M. Kearns and R. E. Shapire, Efficient distribution-free learning of probabilistic concepts, J. Comput. Syst. Sci. 48(3) (1994) 464–497.

MATHCrossRefGoogle Scholar[51]

G.S. Kimeldorf and G. Wahba, A correspondence between Bayesan estimation on stochastic processes and smoothing by splines, Ann. Math. Statist. 41(2) (1971) 495–502.

MathSciNetGoogle Scholar[52]

T.-W. Lee, M. Girolami, A.J. Bell and T. Sejnowski, A unifying information-theoretical framework for independent component analysis, Internat. J. Math. Comput. Mod. (1998) to appear.

[53]

M. Lewicki and T. Sejnowski, Learning nonlinear overcomplete representation for efficient coding, in: *Advances in Neural Information Processing System*(1997).

[54]

G.G. Lorentz,

*Approximation of Functions*(Chelsea, New York, 1986).

MATHGoogle Scholar[55]

D.J.C. MacKay, Introduction to Gaussian processes (1997) (available at the URL: http://wol.ra.phy.cam.ac.uk/mackay).

[56]

W.R. Madych and S.A. Nelson, Polyharmonic cardinal splines: a minimization property, J. Approx. Theory 63 (1990) 303–320.

MATHMathSciNetCrossRefGoogle Scholar[57]

S. Mallat and Z. Zhang, Matching Pursuit in a time-frequency dictionary, IEEE Trans. Signal Process. 41 (1993) 3397–3415.

MATHCrossRefGoogle Scholar[58]

J.L. Marroquin, S. Mitter and T. Poggio, Probabilistic solution of ill-posed problems in computational vision, J. Amer. Statist. Assoc. 82 (1987) 76–89.

MATHCrossRefGoogle Scholar[59]

H.N. Mhaskar, Neural networks for localized approximation of real functions, in:

*Neural Networks for Signal Processing III, Proceedings of the 1993 IEEE-SP Workshop*, eds. C.A. Kamm et al. (IEEE Signal Processing Society, New York, 1993) pp. 190–196.

Google Scholar[60]

C.A. Micchelli, Interpolation of scattered data: distance matrices and conditionally positive definite functions, Constr. Approx. 2 (1986) 11–22.

MATHMathSciNetCrossRefGoogle Scholar[61]

P. Niyogi and F. Girosi, On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions, Neural Comput. 8 (1996) 819–842.

Google Scholar[62]

P. Niyogi, F. Girosi and T. Poggio, Incorporating prior information in machine learning by creating virt ual examples, Proc. IEEE 86(11) (1998) 2196–2209.

CrossRefGoogle Scholar[63]

E. Oja, The nonlinear pca learning rule in independent component analysis, Neurocomput. 17 (1997) 25–45.

CrossRefGoogle Scholar[64]

B. Olshausen, Learning linear, sparse, factorial codes, A.I. Memo 1580, MIT Artificial Intelligence Lab. (1996).

[65]

B.A. Olshausen and D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature381 (1996) 607–609.

CrossRefGoogle Scholar[66]

C. Papageorgiou, F. Girosi and T. Poggio, Sparse correlation kernel based signal reconstruction, Technical Report 1635, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1998). (CBCL Memo 162.)

Google Scholar[67]

G. Parisi,

*Statistical Field Theory*(Addison-Wesley, Reading, MA, 1988).

MATHGoogle Scholar[68]

T. Poggio, On optimal nonlinear associative recall, Biological Cybernetics 19 (1975) 201–209.

MATHMathSciNetCrossRefGoogle Scholar[69]

T. Poggio and F. Girosi, A theory of networks for approximation and learning, A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1989).

Google Scholar[70]

T. Poggio and F. Girosi, Networks for approximation and learning, Proc. IEEE 78(9) (1990).

[71]

T. Poggio and F. Girosi, Networks for approximation and learning, in:

*Foundations of Neural Networks*, ed. C. Lau (IEEE Press, Piscataway, NJ, 1992) pp. 91–106.

Google Scholar[72]

T. Poggio and F. Girosi, A sparse representation for function approximation, Neural Comput. 10(6) (1998).

[73]

T. Poggio, V. Torre and C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314–319.

CrossRefGoogle Scholar[74]

D. Pollard,

*Convergence of Stochastic Processes*(Springer, Berlin, 1984).

MATHGoogle Scholar[75]

M. Pontil, S. Mukherjee and F. Girosi, On the noise model of support vector machine regression, A.I. Memo, MIT Artificial Intelligence Laboratory (1998) (in preparation).

[76]

M. Pontil, R. Rifkin and T. Evgeniou, From regression to classification in support vector machines, A.I. Memo 1649, MIT Artificial Intelligence Lab. (1998).

[77]

M.J.D. Powell, The theory of radial basis functions approximation in 1990, in: *Advances in Numerical Analysis Volume II: Wavelets, Subdivision Algorithms and Radial Basis Functions*, ed. W.A. Light (Oxford University Press, 1992) pp. 105–210.

[78]

C. Rabut, How to build quasi-interpolants. Applications to polyharmonic B-splines, in:

*Curves and Surfaces*, eds. P.-J. Laurent, A. Le M´ehaut´e and L.L. Schumaker (Academic Press, New York, 1991) pp. 391–402.

Google Scholar[79]

C. Rabut, An introduction to Schoenberg's approximation, Comput. Math. Appl. 24(12) (1992) 149–175.

MATHMathSciNetCrossRefGoogle Scholar[80]

R. Rifkin, M. Pontil and A. Verri, A note on support vector machine degeneracy, A.I. Memo, MIT Artificial Intelligence Lab. (1999).

[81]

J. Rissanen, Modeling by shortest data description, Automatica 14 (1978) 465–471.

MATHCrossRefGoogle Scholar[82]

I.J. Schoenberg, Contributions to the problem of approximation of equidistant data by analytic functions, Part A: On the problem of smoothing of graduation, a first class of analytic approximation formulae, Quart. Appl. Math. 4 (1946) 45–99.

MATHMathSciNetGoogle Scholar[83]

I.J. Schoenberg, Cardinal interpolation and spline functions, J. Approx. Theory 2 (1969) 167–206.

MATHMathSciNetCrossRefGoogle Scholar[84]

B. Scholkpof, P. Simard, A. Smola and V. Vapnik, Prior knowledge in suport vector kernels, in: *Advances in Neural Information Processing Systems 9*(1997).

[85]

L.L. Schumaker,

*Spline Functions: Basic Theory*(Wiley, New York, 1981).

MATHGoogle Scholar[86]

J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson and M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory (1998) to appear. Also: NeuroCOLT Technical Report NC-TR-96–053 (1996) ftp://ftp.dcs.rhbnc.ac.uk/pub/neurocolt/tech reports.

[87]

J. Shawe-Taylor and N. Cristianini, Robust bounds on generalization from the margin distribution, Technical Report NeuroCOLT2, Technical Report NC2–TR-1998–029, NeuroCOLT2 (1998).

[88]

B.W. Silverman, Spline smoothing: the equivalent variable kernel method, Ann. Statist. 12 (1984) 898–916.

MATHMathSciNetGoogle Scholar[89]

P. Simard, Y. LeCun and J. Denker, Efficient pattern recognition using a new transformation distance, in:

*Advances in Neural Information Processing Systems 5*(1993) pp. 50–58.

Google Scholar[90]

A. Smola and B. Sch¨olkopf, On a kernel-based method for pattern recognition, regression, approximation and operator inversion, Algorithmica 22 (1998) 211–231.

MATHMathSciNetCrossRefGoogle Scholar[91]

J. Stewart, Positive definite functions and generalizations, an historical survey, Rocky Mountain J. Math. 6 (1976) 409–434.

MATHMathSciNetCrossRefGoogle Scholar[92]

A.N. Tikhonov and V.Y. Arsenin,

*Solutions of Ill-posed Problems*(W.H. Winston, Washington, DC, 1977).

MATHGoogle Scholar[93]

L.G. Valiant, A theory of learnable, in: *Proc. of the 1984 STOC*(1984) pp. 436–445.

[94]

V.N. Vapnik,

*Estimation of Dependences Based on Empirical Data*(Springer, Berlin, 1982).

MATHGoogle Scholar[95]

V.N. Vapnik,

*The Nature of Statistical Learning Theory*(Springer, New York, 1995).

MATHGoogle Scholar[96]

V.N. Vapnik,

*Statistical Learning Theory*(Wiley, New York, 1998).

MATHGoogle Scholar[97]

V.N. Vapnik and A.Y. Chervonenkis, On the uniform convergence of relative frequences of events to their probabilities, Theory Probab. Appl. 17(2) (1971) 264–280.

MathSciNetCrossRefGoogle Scholar[98]

V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for the uniform convergence of averages to their expected values, Teor. Veroyatn. i Primenen. 26(3) (1981) 543–564.

MATHMathSciNetGoogle Scholar[99]

V.N. Vapnik and A.Ya. Chervonenkis, The necessary and sufficient conditions for consistency in the empirical risk minimization method, Pattern Recognition and Image Analysis 1(3) (1991) 283–305.

Google Scholar[100]

G. Wahba, Spline bases, regularization, and generalized cross-validation for solving approximation problems with large quantities of noisy data, in: *Proceedings of the International Conference on Approximation Theory in Honour of George Lorenz*, eds. J. Ward and E. Cheney, Austin, TX January 8–10, 1980 (Academic Press, 1980).

[101]

G. Wahba, A comparison of GCV and GML for choosing the smoothing parameter in the generalized splines smoothing problem, Ann. Statist. 13 (1985) 1378–1402.

MATHMathSciNetGoogle Scholar[102]

G. Wahba,

*Splines Models for Observational Data*, Series in Applied Mathematics, Vol. 59 (SIAM, Philadelphia, PA, 1990).

Google Scholar[103]

R. Williamson, A. Smola and B. Scholkopf, Generalization performance of regularization networks and support vector machines via entropy numbers, Technical Report NC-TR-98–019, Royal Holloway College University of London (1998).