Skip to main content

Fundamentals of Machine Learning

  • Chapter
  • First Online:
Neural Networks and Statistical Learning

Abstract

Learning is a fundamental capability of neural networks. Learning rules are algorithms for finding suitable weights W and/or other network parameters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akaike, H. (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics, 21, 425–439.

    Google Scholar 

  2. Akaike, H. (1970). Statistical prediction information. Annals of the Institute of Statistical Mathematics, 22, 203–217.

    MATH  MathSciNet  Google Scholar 

  3. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.

    MATH  MathSciNet  Google Scholar 

  4. Amari, S., Murata, N., Muller, K. R., Finke, M., & Yang, H. (1996). Statistical theory of overtraining-Is cross-validation asymptotically effective? In D. S. Touretzky, M. C. Mozer & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 176–182). Cambridge, MA: MIT Press.

    Google Scholar 

  5. Anthony, M., & Biggs, N. (1992). Computational learning theory. Cambridge, UK: Cambridge University Press.

    MATH  Google Scholar 

  6. Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 316–322). Cambridge, MA: MIT Press.

    Google Scholar 

  7. Babadi, B., Kalouptsidis, N., & Tarokh, V. (2010). SPARLS: The sparse RLS algorithm. IEEE Transactions on Signal Processing, 58(8), 4013–4025.

    MathSciNet  Google Scholar 

  8. Back, A. D., & Trappenberg, T. P. (2001). Selecting inputs for modeling using normalized higher order statistics and independent component analysis. IEEE Transactions on Neural Networks, 12(3), 612–617.

    Google Scholar 

  9. Baraniuk, R. G., Cevher, V., Duarte, M. F., & Hegde, C. (2010). Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4), 1982–2001.

    MathSciNet  Google Scholar 

  10. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945.

    MATH  MathSciNet  Google Scholar 

  11. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846.

    Google Scholar 

  12. Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536.

    MATH  MathSciNet  Google Scholar 

  13. Bartlett, P. L. (1993). Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 144–150). New York: ACM Press.

    Google Scholar 

  14. Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 1188–1192). Cambridge, MA: MIT Press.

    Google Scholar 

  15. Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.

    Google Scholar 

  16. Baum, E. B., & Wilczek, F. (1988). Supervised learning of probability distributions by neural networks. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 52–61). New York: American Institute of Physics.

    Google Scholar 

  17. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160.

    Google Scholar 

  18. Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399–2434.

    MATH  MathSciNet  Google Scholar 

  19. Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of \(K\)-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105.

    MATH  MathSciNet  Google Scholar 

  20. Bernier, J. L., Ortega, J., Ros, E., Rojas, I., & Prieto, A. (2000). A quantitative study of fault tolerance, noise immunity, and generalization ability of MLPs. Neural Computation, 12, 2941–2964.

    Google Scholar 

  21. Biau, G., Bunea, F., & Wegkamp, M. (2005). Functional classification in Hilbert spaces. IEEE Transactions on Information Theory, 51, 2163–2172.

    MathSciNet  Google Scholar 

  22. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford Press.

    Google Scholar 

  23. Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.

    Google Scholar 

  24. Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer.

    MATH  Google Scholar 

  25. Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117–127.

    Google Scholar 

  26. Bousquet, O., & Elisseeff, A. (2002). Stability and Generalization. Journal of Machine Learning Research, 2, 499–526.

    MATH  MathSciNet  Google Scholar 

  27. Cai, J.-F., Candes, E. J., & Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal of Optimization, 20(4), 1956–1982.

    Google Scholar 

  28. Candes, E. J. (2006). Compressive sampling. In Proceedings of International Congress on Mathematicians, Madrid, Spain (Vol. 3, pp. 1433–1452).

    Google Scholar 

  29. Candes, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9, 717–772.

    MATH  MathSciNet  Google Scholar 

  30. Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41–75.

    Google Scholar 

  31. Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999). No free lunch for early stropping. Neural Computation, 11, 995–1009.

    Google Scholar 

  32. Cawley, G. C., & Talbot, N. L. C. (2007). Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. Journal of Machine Learning Research, 8, 841–861.

    MATH  Google Scholar 

  33. Cawley, G. C., & Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107.

    MATH  MathSciNet  Google Scholar 

  34. Chawla, N., Bowyer, K., & Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

    MATH  Google Scholar 

  35. Chen, D. S., & Jain, R. C. (1994). A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3), 467–479.

    Google Scholar 

  36. Chen, S. S., Donoho, D. L., & Saunders, M. A. (1999). Atomic decomposition by basis pursuit. SIAM Journal of Scientific Computing, 20(1), 33–61.

    MATH  MathSciNet  Google Scholar 

  37. Chen, X., Wang, Z. J., & McKeown, M. J. (2010). Asymptotic analysis of robust LASSOs in the presence of noise with large variance. IEEE Transactions on Information Theory, 56(10), 5131–5149.

    MathSciNet  Google Scholar 

  38. Chen, Y., Gu, Y., & Hero, A. O., III. (2009). Sparse LMS for system identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 3125–3128). Taipei, Taiwan.

    Google Scholar 

  39. Cherkassky, V., & Ma, Y. (2003). Comparison of model selection for regression. Neural Computation, 15, 1691–1714.

    Google Scholar 

  40. Cherkassky, V., & Mulier, F. (2007). Learning from data (2nd ed.). New York: Wiley.

    MATH  Google Scholar 

  41. Cherkassky, V., & Ma, Y. (2009). Another look at statistical learning theory and regularization. Neural Networks, 22, 958–969.

    Google Scholar 

  42. Chiu, C., Mehrotra, K., Mohan, C. K., & Ranka, S. (1994). Modifying training algorithms for improved fault tolerance. In Proceedings of IEEE International Conference on Neural Networks (Vol. 4, pp. 333–338).

    Google Scholar 

  43. Cichocki, A., & Unbehauen, R. (1992). Neural networks for optimization and signal processing. New York: Wiley.

    Google Scholar 

  44. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334.

    Google Scholar 

  45. Denker, J. S., Schwartz, D., Wittner, B., Solla, S. A., Howard, R., Jackel, L., et al. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems, 1, 877–922.

    MATH  MathSciNet  Google Scholar 

  46. Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89, 31–71.

    MATH  Google Scholar 

  47. Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3, 409–425.

    Google Scholar 

  48. Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306.

    MathSciNet  Google Scholar 

  49. Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal \(l_1\)-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59, 797–829.

    MATH  MathSciNet  Google Scholar 

  50. Donoho, D. L., Maleki, A., & Montanari, A. (2009). Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences of the USA, 106(45), 18914–18919.

    Google Scholar 

  51. Duda, R., Hart, P., & Stork, D. (2000). Pattern classification (2nd ed.). New York: Wiley.

    Google Scholar 

  52. Edwards, P. J., & Murray, A. F. (1998). Towards optimally distributed computation. Neural Computation, 10, 997–1015.

    Google Scholar 

  53. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.

    MATH  MathSciNet  Google Scholar 

  54. Estevez, P. A., Tesmer, M., Perez, C. A., & Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2), 189–201.

    Google Scholar 

  55. Fedorov, V. V. (1972). Theory of optimal experiments. San Diego, CA: Academic Press.

    Google Scholar 

  56. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168.

    MATH  Google Scholar 

  57. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23(9), 881–889.

    MATH  Google Scholar 

  58. Friedrichs, F., & Schmitt, M. (2005). On the power of Boolean computations in generalized RBF neural networks. Neurocomputing, 63, 483–498.

    Google Scholar 

  59. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.

    Google Scholar 

  60. Genovese, C. R., Jin, J., Wasserman, L., & Yao, Z. (2012). A comparison of the lasso and marginal regression. Journal of Machine Learning Research, 13, 2107–2143.

    MathSciNet  Google Scholar 

  61. Ghodsi, A., & Schuurmans, D. (2003). Automatic basis selection techniques for RBF networks. Neural Networks, 16, 809–816.

    Google Scholar 

  62. Gish, H. (1990). A probabilistic approach to the understanding and training of neural network classifiers. In "Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 1361–1364).

    Google Scholar 

  63. Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9(6), 1245–1249.

    Google Scholar 

  64. Hanson, S. J., & Burr, D. J. (1988). Minkowski back-propagation: Learning in connectionist models with non-Euclidean error signals. In D. Z. Anderson (Ed.), Neural Information processing systems (pp. 348–357). New York: American Institute of Physics.

    Google Scholar 

  65. Hassoun, M. H. (1995). Fundamentals of artificial neural networks. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  66. Hastad, J. T. (1987). Computational limitations for small depth circuits. Cambridge, MA: MIT Press.

    Google Scholar 

  67. Hastad, J., & Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity, 1, 113–129.

    MATH  MathSciNet  Google Scholar 

  68. Hastie, T., Tibshirani, R., & Friedman, J. (2005). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Berlin: Springer.

    Google Scholar 

  69. Haussler, D. (1990). Probably approximately correct learning. In Proceedings of 8th National Conference on Artificial Intelligence (Vol. 2, pp. 1101–1108). Boston, MA.

    Google Scholar 

  70. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall.

    Google Scholar 

  71. Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem. In Proceedings of the 1st IEEE International Conference on Neural Networks (Vol. 3, pp. 11–14). San Diego, CA.

    Google Scholar 

  72. Hinton, G. E. (1989). Connectionist learning procedure. Artificial Intelligence, 40, 185–234.

    Google Scholar 

  73. Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 5–13). Santa Cruz, CA.

    Google Scholar 

  74. Ho, K. I.-J., Leung, C.-S., & Sum, J. (2010). Convergence and objective functions of some fault/noise-injection-based online learning algorithms for RBF networks. IEEE Transactions on Neural Networks, 21(6), 938–947.

    Google Scholar 

  75. Hoi, S. C. H., Jin, R., & Lyu, M. R. (2009). Batch mode active learning with applications to text categorization and image retrieval. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1233–1248.

    Google Scholar 

  76. Holmstrom, L., & Koistinen, P. (1992). Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, 3(1), 24–38.

    Google Scholar 

  77. Huber, P. J. (1981). Robust statistics. New York: Wiley.

    MATH  Google Scholar 

  78. Janssen, P., Stoica, P., Soderstrom, T., & Eykhoff, P. (1988). Model structure selection for multivariable systems by cross-validation. International Journal of Control, 47, 1737–1758.

    MATH  MathSciNet  Google Scholar 

  79. Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324.

    MATH  Google Scholar 

  80. Koiran, P., & Sontag, E. D. (1996). Neural networks with quadratic VC dimension. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 197–203). Cambridge, MA: MIT Press.

    Google Scholar 

  81. Kolmogorov, A. N. (1957). On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk USSR, 114(5), 953–956.

    MATH  MathSciNet  Google Scholar 

  82. Krogh, A., & Hertz, J. A. (1992). A simple weight decay improves generalization. In Proceedings of Neural Information and Processing Systems (NIPS) Conference (pp. 950–957). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  83. Leiva-Murillo, J. M., & Artes-Rodriguez, A. (2007). Maximization of mutual information for supervised linear feature extraction. IEEE Transactions on Neural Networks, 18(5), 1433–1441.

    Google Scholar 

  84. Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine Learning, 46, 191–202.

    MATH  Google Scholar 

  85. Lin, D., Pitler, E., Foster, D. P., & Ungar, L. H. (2008). In defense of \(l_0\). In Proceedings of International Conference on Machine Learning: Workshop of Sparse Optimization and Variable Selection. Helsinki, Finland.

    Google Scholar 

  86. Liu, E., & Temlyakov, V. N. (2012). The orthogonal super greedy algorithm and applications in compressed sensing. IEEE Transactions on Information Theory, 58(4), 2040–2047.

    MathSciNet  Google Scholar 

  87. Liu, Y., Starzyk, J. A., & Zhu, Z. (2008). Optimized approximation algorithm in neural networks without overfitting. IEEE Transactions on Neural Networks, 19(6), 983–995.

    Google Scholar 

  88. Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12, 2519–2535.

    Google Scholar 

  89. MacKay, D. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4), 590–604.

    Google Scholar 

  90. Magdon-Ismail, M. (2000). No free lunch for noise prediction. Neural Computation, 12, 547–564.

    Google Scholar 

  91. Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. (2005). Analysis of variance of cross-validation estimators of the generalization error. Journal of Machine Learning Research, 6, 1127–1168.

    MATH  MathSciNet  Google Scholar 

  92. Matsuoka, K., & Yi, J. (1991). Backpropagation based on the logarithmic error function and elimination of local minima. In Proceedings of the International Joint Conference on Neural Networks (pp. 1117–1122). Seattle, WA.

    Google Scholar 

  93. Muller, B., Reinhardt, J., & Strickland, M. (1995). Neural networks: An introduction (2nd ed.). Berlin: Springer.

    Google Scholar 

  94. Murray, A. F., & Edwards, P. J. (1994). Synaptic weight noise euring MLP training: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5), 792–802.

    Google Scholar 

  95. Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281.

    MATH  Google Scholar 

  96. Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal of Computing, 24(2), 227–234.

    MATH  MathSciNet  Google Scholar 

  97. Niyogi, P., & Girosi, F. (1999). Generalization bounds for function approximation from scattered noisy data. In Advances in computational mathematics (Vol. 10, pp. 51–80). Berlin: Springer.

    Google Scholar 

  98. Nowlan, S. J., & Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4), 473–493.

    Google Scholar 

  99. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.

    Google Scholar 

  100. Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(1), 1065–1076.

    MATH  MathSciNet  Google Scholar 

  101. Pati, Y. C., Rezaiifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annuual Asilomar Conference on Signals, Systems, and Computers (Vol. 1, pp. 40–44).

    Google Scholar 

  102. Phatak, D. S. (1999). Relationship between fault tolerance, generalization and the Vapnik-Cervonenkis (VC) dimension of feedforward ANNs. In Proceedings of International Joint Conference on Neural Networks (Vol. 1, pp. 705–709).

    Google Scholar 

  103. Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.

    Google Scholar 

  104. Plutowski, M. E. P. (1996). Survey: Cross-validation in theory and in practice. Research Report, Princeton, NJ: Department of Computational Science Research, David Sarnoff Research Center.

    Google Scholar 

  105. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497.

    Google Scholar 

  106. Prechelt, L. (1998). Automatic early stopping using cross validation: Quantifying the criteria. Neural Networks, 11, 761–767.

    Google Scholar 

  107. Ramsay, J., & Silverman, B. (1997). Functional data analysis. New York: Springer.

    MATH  Google Scholar 

  108. Reed, R., Marks, R. J., II, & Oh, S. (1995). Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6(3), 529–538.

    Google Scholar 

  109. Rimer, M., & Martinez, T. (2006). Classification-based objective functions. Machine Learning, 63(2), 183–205.

    MATH  Google Scholar 

  110. Rimer, M., & Martinez, T. (2006). CB3: an adaptive error function for backpropagation training. Neural Processing Letters, 24, 81–92.

    Google Scholar 

  111. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge, UK: Cambridge University Press.

    MATH  Google Scholar 

  112. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–477.

    MATH  Google Scholar 

  113. Rissanen, J. (1999). Hypothesis selection and testing by the MDL principle. Computer Journal, 42(4), 260–269.

    MATH  MathSciNet  Google Scholar 

  114. Rivals, I., & Personnaz, L. (1999). On cross-validation for model selection. Neural Computation, 11(4), 863–870.

    Google Scholar 

  115. Rossi, F., & Conan-Guez, B. (2005). Functional multi-layer perceptron: A non-linear tool for functional data analysis. Neural Networks, 18, 45–60.

    MATH  Google Scholar 

  116. Rossi, F., Delannay, N., Conan-Guez, B., & Verleysen, M. (2005). Representation of functional data in neural networks. Neurocomputing, 64, 183–210.

    Google Scholar 

  117. Rossi, F., & Villa, N. (2006). Support vector machine for functional data classification. Neurocomputing, 69, 730–742.

    Google Scholar 

  118. Royden, H. L. (1968). Real analysis (2nd ed.). New York: Macmillan.

    Google Scholar 

  119. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Foundation (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.

    Google Scholar 

  120. Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation: The basic theory. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architecture, and applications (pp. 1–34). Hillsdale, NJ: Lawrence Erlbaum.

    Google Scholar 

  121. Sabato, S., & Tishby, N. (2012). Multi-instance learning with any hypothesis class. Journal of Machine Learning Research, 13, 2999–3039.

    MathSciNet  Google Scholar 

  122. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.

    Google Scholar 

  123. Schmitt, M. (2005). On the capabilities of higher-order neurons: A radial basis function approach. Neural Computation, 17, 715–729.

    Google Scholar 

  124. Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.

    Google Scholar 

  125. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.

    MATH  MathSciNet  Google Scholar 

  126. Shao, X., Cherkassky, V., & Li, W. (2000). Measuring the VC-dimension using optimized experimental design. Neural Computing, 12, 1969–1986.

    Google Scholar 

  127. Shawe-Taylor, J. (1995). Sample sizes for sigmoidal neural networks. In Proceedings of the 8th Annual Conference on Computational Learning Theory (pp. 258–264). Santa Cruz, CA.

    Google Scholar 

  128. Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150.

    MATH  MathSciNet  Google Scholar 

  129. Silva, L. M., de Sa, J. M., & Alexandre, L. A. (2008). Data classification with multilayer perceptrons using a generalized error function. Neural Networks, 21, 1302–1310.

    MATH  Google Scholar 

  130. Sima, J. (1996). Back-propagation is not efficient. Neural Networks, 9(6), 1017–1023.

    Google Scholar 

  131. Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems, 2, 625–640.

    MATH  MathSciNet  Google Scholar 

  132. Stoica, P., & Selen, Y. (2004). A review of information criterion rules. IEEE Signal Processing Magazine, 21(4), 36–47.

    Google Scholar 

  133. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B, 36, 111–147.

    Google Scholar 

  134. Sugiyama, M., & Ogawa, H. (2000). Incremental active learning for optimal generalization. Neural Computation, 12, 2909–2940.

    Google Scholar 

  135. Sugiyama, M., & Nakajima, S. (2009). Pool-based active learning in approximate linear regression. Machine Learning, 75, 249–274.

    Google Scholar 

  136. Sum, J. P.-F., Leung, C.-S., & Ho, K. I.-J. (2012). On-line node fault injection training algorithm for MLP networks: Objective function and convergence analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(2), 211–222.

    Google Scholar 

  137. Tabatabai, M. A., & Argyros, I. K. (1993). Robust estimation and testing for general nonlinear regression models. Applied Mathematics and Computation, 58, 85–101.

    MATH  MathSciNet  Google Scholar 

  138. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.

    Google Scholar 

  139. Tikhonov, A. N. (1963). On solving incorrectly posed problems and method of regularization. Doklady Akademii Nauk USSR, 151, 501–504.

    Google Scholar 

  140. Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50, 2231–2242.

    MathSciNet  Google Scholar 

  141. Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655–4666.

    MathSciNet  Google Scholar 

  142. Valiant, P. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.

    Google Scholar 

  143. Vapnik, V. N., & Chervonenkis, A. J. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & its Applications, 16, 264–280.

    MATH  Google Scholar 

  144. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer.

    MATH  Google Scholar 

  145. Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876.

    Google Scholar 

  146. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.

    MATH  Google Scholar 

  147. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.

    MATH  Google Scholar 

  148. Wang, J., Kwon, S., & Shim, B. (2012). Generalized orthogonal matching pursuit. IEEE Transactions on Signal Processing, 60(12), 6202–6216.

    MathSciNet  Google Scholar 

  149. Widrow, B., & Lehr, M. A. (1990). 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9), 1415–1442.

    Google Scholar 

  150. Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search, SFI-TR-95-02-010, Santa Fe Institute.

    Google Scholar 

  151. Wu, G., & Cheng, E. (2003). Class-boundary alignment for imbalanced dataset learning. In Proceedings of ICML 2003 Workshop on Learning Imbalanced Data Sets II (pp. 49–56). Washington, DC.

    Google Scholar 

  152. Xu, H., Caramanis, C., & Mannor, S. (2010). Robust regression and Lasso. IEEE Transactions on Information Theory, 56(7), 3561–3574.

    MathSciNet  Google Scholar 

  153. Xu, H., Caramanis, C., & Mannor, S. (2012). Sparse algorithms are not stable: A no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 187–193.

    MathSciNet  Google Scholar 

  154. Yang, L., Hanneke, S., & Carbonell, J. (2013). A theory of transfer learning with applications to active learning. Machine Learning, 90(2), 161–189.

    Google Scholar 

  155. Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of 26th Annual IEEE Symposium on Foundations Computer Science (pp. 1–10).

    Google Scholar 

  156. Zahalka, J., & Zelezny, F. (2011). An experimental test of Occam’s razor in classification. Machine Learning, 82, 475–481.

    Google Scholar 

  157. Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2), 894–942.

    MATH  MathSciNet  Google Scholar 

  158. Zhu, H. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ke-Lin Du .

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this chapter

Cite this chapter

Du, KL., Swamy, M.N.S. (2014). Fundamentals of Machine Learning. In: Neural Networks and Statistical Learning. Springer, London. https://doi.org/10.1007/978-1-4471-5571-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5571-3_2

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5570-6

  • Online ISBN: 978-1-4471-5571-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics