Advertisement

Fundamentals of Machine Learning

  • Ke-Lin DuEmail author
  • M. N. S. Swamy
Chapter

Abstract

Learning is a fundamental capability of neural networks. Learning rules are algorithms for finding suitable weights W and/or other network parameters.

Keywords

Mean Square Error Boolean Function Generalization Error Orthogonal Match Pursuit Restricted Isometry Property 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Akaike, H. (1969). Fitting autoregressive models for prediction. Annals of the Institute of Statistical Mathematics, 21, 425–439.Google Scholar
  2. 2.
    Akaike, H. (1970). Statistical prediction information. Annals of the Institute of Statistical Mathematics, 22, 203–217.zbMATHMathSciNetGoogle Scholar
  3. 3.
    Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.zbMATHMathSciNetGoogle Scholar
  4. 4.
    Amari, S., Murata, N., Muller, K. R., Finke, M., & Yang, H. (1996). Statistical theory of overtraining-Is cross-validation asymptotically effective? In D. S. Touretzky, M. C. Mozer & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 176–182). Cambridge, MA: MIT Press.Google Scholar
  5. 5.
    Anthony, M., & Biggs, N. (1992). Computational learning theory. Cambridge, UK: Cambridge University Press.zbMATHGoogle Scholar
  6. 6.
    Auer, P., Herbster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. In D. S. Touretzky, M. C. Mozer & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 316–322). Cambridge, MA: MIT Press.Google Scholar
  7. 7.
    Babadi, B., Kalouptsidis, N., & Tarokh, V. (2010). SPARLS: The sparse RLS algorithm. IEEE Transactions on Signal Processing, 58(8), 4013–4025.MathSciNetGoogle Scholar
  8. 8.
    Back, A. D., & Trappenberg, T. P. (2001). Selecting inputs for modeling using normalized higher order statistics and independent component analysis. IEEE Transactions on Neural Networks, 12(3), 612–617.Google Scholar
  9. 9.
    Baraniuk, R. G., Cevher, V., Duarte, M. F., & Hegde, C. (2010). Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4), 1982–2001.MathSciNetGoogle Scholar
  10. 10.
    Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945.zbMATHMathSciNetGoogle Scholar
  11. 11.
    Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846.Google Scholar
  12. 12.
    Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536.zbMATHMathSciNetGoogle Scholar
  13. 13.
    Bartlett, P. L. (1993). Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 144–150). New York: ACM Press.Google Scholar
  14. 14.
    Bartlett, P. L., & Maass, W. (2003). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 1188–1192). Cambridge, MA: MIT Press.Google Scholar
  15. 15.
    Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550.Google Scholar
  16. 16.
    Baum, E. B., & Wilczek, F. (1988). Supervised learning of probability distributions by neural networks. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 52–61). New York: American Institute of Physics.Google Scholar
  17. 17.
    Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160.Google Scholar
  18. 18.
    Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7, 2399–2434.zbMATHMathSciNetGoogle Scholar
  19. 19.
    Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance of \(K\)-fold cross-validation. Journal of Machine Learning Research, 5, 1089–1105.zbMATHMathSciNetGoogle Scholar
  20. 20.
    Bernier, J. L., Ortega, J., Ros, E., Rojas, I., & Prieto, A. (2000). A quantitative study of fault tolerance, noise immunity, and generalization ability of MLPs. Neural Computation, 12, 2941–2964.Google Scholar
  21. 21.
    Biau, G., Bunea, F., & Wegkamp, M. (2005). Functional classification in Hilbert spaces. IEEE Transactions on Information Theory, 51, 2163–2172.MathSciNetGoogle Scholar
  22. 22.
    Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford Press.Google Scholar
  23. 23.
    Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.Google Scholar
  24. 24.
    Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer.zbMATHGoogle Scholar
  25. 25.
    Blum, A. L., & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117–127.Google Scholar
  26. 26.
    Bousquet, O., & Elisseeff, A. (2002). Stability and Generalization. Journal of Machine Learning Research, 2, 499–526.zbMATHMathSciNetGoogle Scholar
  27. 27.
    Cai, J.-F., Candes, E. J., & Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal of Optimization, 20(4), 1956–1982.Google Scholar
  28. 28.
    Candes, E. J. (2006). Compressive sampling. In Proceedings of International Congress on Mathematicians, Madrid, Spain (Vol. 3, pp. 1433–1452).Google Scholar
  29. 29.
    Candes, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9, 717–772.zbMATHMathSciNetGoogle Scholar
  30. 30.
    Caruana, R. (1997). Multitask learning. Machine Learning, 28, 41–75.Google Scholar
  31. 31.
    Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999). No free lunch for early stropping. Neural Computation, 11, 995–1009.Google Scholar
  32. 32.
    Cawley, G. C., & Talbot, N. L. C. (2007). Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters. Journal of Machine Learning Research, 8, 841–861.zbMATHGoogle Scholar
  33. 33.
    Cawley, G. C., & Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107.zbMATHMathSciNetGoogle Scholar
  34. 34.
    Chawla, N., Bowyer, K., & Kegelmeyer, W. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.zbMATHGoogle Scholar
  35. 35.
    Chen, D. S., & Jain, R. C. (1994). A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3), 467–479.Google Scholar
  36. 36.
    Chen, S. S., Donoho, D. L., & Saunders, M. A. (1999). Atomic decomposition by basis pursuit. SIAM Journal of Scientific Computing, 20(1), 33–61.zbMATHMathSciNetGoogle Scholar
  37. 37.
    Chen, X., Wang, Z. J., & McKeown, M. J. (2010). Asymptotic analysis of robust LASSOs in the presence of noise with large variance. IEEE Transactions on Information Theory, 56(10), 5131–5149.MathSciNetGoogle Scholar
  38. 38.
    Chen, Y., Gu, Y., & Hero, A. O., III. (2009). Sparse LMS for system identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 3125–3128). Taipei, Taiwan.Google Scholar
  39. 39.
    Cherkassky, V., & Ma, Y. (2003). Comparison of model selection for regression. Neural Computation, 15, 1691–1714.Google Scholar
  40. 40.
    Cherkassky, V., & Mulier, F. (2007). Learning from data (2nd ed.). New York: Wiley.zbMATHGoogle Scholar
  41. 41.
    Cherkassky, V., & Ma, Y. (2009). Another look at statistical learning theory and regularization. Neural Networks, 22, 958–969.Google Scholar
  42. 42.
    Chiu, C., Mehrotra, K., Mohan, C. K., & Ranka, S. (1994). Modifying training algorithms for improved fault tolerance. In Proceedings of IEEE International Conference on Neural Networks (Vol. 4, pp. 333–338).Google Scholar
  43. 43.
    Cichocki, A., & Unbehauen, R. (1992). Neural networks for optimization and signal processing. New York: Wiley.Google Scholar
  44. 44.
    Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Transactions on Electronic Computers, 14, 326–334.Google Scholar
  45. 45.
    Denker, J. S., Schwartz, D., Wittner, B., Solla, S. A., Howard, R., Jackel, L., et al. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems, 1, 877–922.zbMATHMathSciNetGoogle Scholar
  46. 46.
    Dietterich, T. G., Lathrop, R. H., & Lozano-Perez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89, 31–71.zbMATHGoogle Scholar
  47. 47.
    Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3, 409–425.Google Scholar
  48. 48.
    Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52(4), 1289–1306.MathSciNetGoogle Scholar
  49. 49.
    Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal \(l_1\)-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59, 797–829.zbMATHMathSciNetGoogle Scholar
  50. 50.
    Donoho, D. L., Maleki, A., & Montanari, A. (2009). Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences of the USA, 106(45), 18914–18919.Google Scholar
  51. 51.
    Duda, R., Hart, P., & Stork, D. (2000). Pattern classification (2nd ed.). New York: Wiley.Google Scholar
  52. 52.
    Edwards, P. J., & Murray, A. F. (1998). Towards optimally distributed computation. Neural Computation, 10, 997–1015.Google Scholar
  53. 53.
    Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.zbMATHMathSciNetGoogle Scholar
  54. 54.
    Estevez, P. A., Tesmer, M., Perez, C. A., & Zurada, J. M. (2009). Normalized mutual information feature selection. IEEE Transactions on Neural Networks, 20(2), 189–201.Google Scholar
  55. 55.
    Fedorov, V. V. (1972). Theory of optimal experiments. San Diego, CA: Academic Press.Google Scholar
  56. 56.
    Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168.zbMATHGoogle Scholar
  57. 57.
    Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, 23(9), 881–889.zbMATHGoogle Scholar
  58. 58.
    Friedrichs, F., & Schmitt, M. (2005). On the power of Boolean computations in generalized RBF neural networks. Neurocomputing, 63, 483–498.Google Scholar
  59. 59.
    Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.Google Scholar
  60. 60.
    Genovese, C. R., Jin, J., Wasserman, L., & Yao, Z. (2012). A comparison of the lasso and marginal regression. Journal of Machine Learning Research, 13, 2107–2143.MathSciNetGoogle Scholar
  61. 61.
    Ghodsi, A., & Schuurmans, D. (2003). Automatic basis selection techniques for RBF networks. Neural Networks, 16, 809–816.Google Scholar
  62. 62.
    Gish, H. (1990). A probabilistic approach to the understanding and training of neural network classifiers. In "Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 1361–1364).Google Scholar
  63. 63.
    Goutte, C. (1997). Note on free lunches and cross-validation. Neural Computation, 9(6), 1245–1249.Google Scholar
  64. 64.
    Hanson, S. J., & Burr, D. J. (1988). Minkowski back-propagation: Learning in connectionist models with non-Euclidean error signals. In D. Z. Anderson (Ed.), Neural Information processing systems (pp. 348–357). New York: American Institute of Physics.Google Scholar
  65. 65.
    Hassoun, M. H. (1995). Fundamentals of artificial neural networks. Cambridge, MA: MIT Press.zbMATHGoogle Scholar
  66. 66.
    Hastad, J. T. (1987). Computational limitations for small depth circuits. Cambridge, MA: MIT Press.Google Scholar
  67. 67.
    Hastad, J., & Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity, 1, 113–129.zbMATHMathSciNetGoogle Scholar
  68. 68.
    Hastie, T., Tibshirani, R., & Friedman, J. (2005). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Berlin: Springer.Google Scholar
  69. 69.
    Haussler, D. (1990). Probably approximately correct learning. In Proceedings of 8th National Conference on Artificial Intelligence (Vol. 2, pp. 1101–1108). Boston, MA.Google Scholar
  70. 70.
    Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall.Google Scholar
  71. 71.
    Hecht-Nielsen, R. (1987). Kolmogorov’s mapping neural network existence theorem. In Proceedings of the 1st IEEE International Conference on Neural Networks (Vol. 3, pp. 11–14). San Diego, CA.Google Scholar
  72. 72.
    Hinton, G. E. (1989). Connectionist learning procedure. Artificial Intelligence, 40, 185–234.Google Scholar
  73. 73.
    Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual ACM Conference on Computational Learning Theory (pp. 5–13). Santa Cruz, CA.Google Scholar
  74. 74.
    Ho, K. I.-J., Leung, C.-S., & Sum, J. (2010). Convergence and objective functions of some fault/noise-injection-based online learning algorithms for RBF networks. IEEE Transactions on Neural Networks, 21(6), 938–947.Google Scholar
  75. 75.
    Hoi, S. C. H., Jin, R., & Lyu, M. R. (2009). Batch mode active learning with applications to text categorization and image retrieval. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1233–1248.Google Scholar
  76. 76.
    Holmstrom, L., & Koistinen, P. (1992). Using additive noise in back-propagation training. IEEE Transactions on Neural Networks, 3(1), 24–38.Google Scholar
  77. 77.
    Huber, P. J. (1981). Robust statistics. New York: Wiley.zbMATHGoogle Scholar
  78. 78.
    Janssen, P., Stoica, P., Soderstrom, T., & Eykhoff, P. (1988). Model structure selection for multivariable systems by cross-validation. International Journal of Control, 47, 1737–1758.zbMATHMathSciNetGoogle Scholar
  79. 79.
    Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97, 273–324.zbMATHGoogle Scholar
  80. 80.
    Koiran, P., & Sontag, E. D. (1996). Neural networks with quadratic VC dimension. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 197–203). Cambridge, MA: MIT Press.Google Scholar
  81. 81.
    Kolmogorov, A. N. (1957). On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk USSR, 114(5), 953–956.zbMATHMathSciNetGoogle Scholar
  82. 82.
    Krogh, A., & Hertz, J. A. (1992). A simple weight decay improves generalization. In Proceedings of Neural Information and Processing Systems (NIPS) Conference (pp. 950–957). San Mateo, CA: Morgan Kaufmann.Google Scholar
  83. 83.
    Leiva-Murillo, J. M., & Artes-Rodriguez, A. (2007). Maximization of mutual information for supervised linear feature extraction. IEEE Transactions on Neural Networks, 18(5), 1433–1441.Google Scholar
  84. 84.
    Lin, Y., Lee, Y., & Wahba, G. (2002). Support vector machines for classification in nonstandard situations. Machine Learning, 46, 191–202.zbMATHGoogle Scholar
  85. 85.
    Lin, D., Pitler, E., Foster, D. P., & Ungar, L. H. (2008). In defense of \(l_0\). In Proceedings of International Conference on Machine Learning: Workshop of Sparse Optimization and Variable Selection. Helsinki, Finland.Google Scholar
  86. 86.
    Liu, E., & Temlyakov, V. N. (2012). The orthogonal super greedy algorithm and applications in compressed sensing. IEEE Transactions on Information Theory, 58(4), 2040–2047.MathSciNetGoogle Scholar
  87. 87.
    Liu, Y., Starzyk, J. A., & Zhu, Z. (2008). Optimized approximation algorithm in neural networks without overfitting. IEEE Transactions on Neural Networks, 19(6), 983–995.Google Scholar
  88. 88.
    Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12, 2519–2535.Google Scholar
  89. 89.
    MacKay, D. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4), 590–604.Google Scholar
  90. 90.
    Magdon-Ismail, M. (2000). No free lunch for noise prediction. Neural Computation, 12, 547–564.Google Scholar
  91. 91.
    Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. (2005). Analysis of variance of cross-validation estimators of the generalization error. Journal of Machine Learning Research, 6, 1127–1168.zbMATHMathSciNetGoogle Scholar
  92. 92.
    Matsuoka, K., & Yi, J. (1991). Backpropagation based on the logarithmic error function and elimination of local minima. In Proceedings of the International Joint Conference on Neural Networks (pp. 1117–1122). Seattle, WA.Google Scholar
  93. 93.
    Muller, B., Reinhardt, J., & Strickland, M. (1995). Neural networks: An introduction (2nd ed.). Berlin: Springer.Google Scholar
  94. 94.
    Murray, A. F., & Edwards, P. J. (1994). Synaptic weight noise euring MLP training: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5), 792–802.Google Scholar
  95. 95.
    Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281.zbMATHGoogle Scholar
  96. 96.
    Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM Journal of Computing, 24(2), 227–234.zbMATHMathSciNetGoogle Scholar
  97. 97.
    Niyogi, P., & Girosi, F. (1999). Generalization bounds for function approximation from scattered noisy data. In Advances in computational mathematics (Vol. 10, pp. 51–80). Berlin: Springer.Google Scholar
  98. 98.
    Nowlan, S. J., & Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4), 473–493.Google Scholar
  99. 99.
    Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.Google Scholar
  100. 100.
    Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(1), 1065–1076.zbMATHMathSciNetGoogle Scholar
  101. 101.
    Pati, Y. C., Rezaiifar, R., & Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annuual Asilomar Conference on Signals, Systems, and Computers (Vol. 1, pp. 40–44).Google Scholar
  102. 102.
    Phatak, D. S. (1999). Relationship between fault tolerance, generalization and the Vapnik-Cervonenkis (VC) dimension of feedforward ANNs. In Proceedings of International Joint Conference on Neural Networks (Vol. 1, pp. 705–709).Google Scholar
  103. 103.
    Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.Google Scholar
  104. 104.
    Plutowski, M. E. P. (1996). Survey: Cross-validation in theory and in practice. Research Report, Princeton, NJ: Department of Computational Science Research, David Sarnoff Research Center.Google Scholar
  105. 105.
    Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497.Google Scholar
  106. 106.
    Prechelt, L. (1998). Automatic early stopping using cross validation: Quantifying the criteria. Neural Networks, 11, 761–767.Google Scholar
  107. 107.
    Ramsay, J., & Silverman, B. (1997). Functional data analysis. New York: Springer.zbMATHGoogle Scholar
  108. 108.
    Reed, R., Marks, R. J., II, & Oh, S. (1995). Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter. IEEE Transactions on Neural Networks, 6(3), 529–538.Google Scholar
  109. 109.
    Rimer, M., & Martinez, T. (2006). Classification-based objective functions. Machine Learning, 63(2), 183–205.zbMATHGoogle Scholar
  110. 110.
    Rimer, M., & Martinez, T. (2006). CB3: an adaptive error function for backpropagation training. Neural Processing Letters, 24, 81–92.Google Scholar
  111. 111.
    Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge, UK: Cambridge University Press.zbMATHGoogle Scholar
  112. 112.
    Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5), 465–477.zbMATHGoogle Scholar
  113. 113.
    Rissanen, J. (1999). Hypothesis selection and testing by the MDL principle. Computer Journal, 42(4), 260–269.zbMATHMathSciNetGoogle Scholar
  114. 114.
    Rivals, I., & Personnaz, L. (1999). On cross-validation for model selection. Neural Computation, 11(4), 863–870.Google Scholar
  115. 115.
    Rossi, F., & Conan-Guez, B. (2005). Functional multi-layer perceptron: A non-linear tool for functional data analysis. Neural Networks, 18, 45–60.zbMATHGoogle Scholar
  116. 116.
    Rossi, F., Delannay, N., Conan-Guez, B., & Verleysen, M. (2005). Representation of functional data in neural networks. Neurocomputing, 64, 183–210.Google Scholar
  117. 117.
    Rossi, F., & Villa, N. (2006). Support vector machine for functional data classification. Neurocomputing, 69, 730–742.Google Scholar
  118. 118.
    Royden, H. L. (1968). Real analysis (2nd ed.). New York: Macmillan.Google Scholar
  119. 119.
    Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Foundation (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.Google Scholar
  120. 120.
    Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation: The basic theory. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architecture, and applications (pp. 1–34). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
  121. 121.
    Sabato, S., & Tishby, N. (2012). Multi-instance learning with any hypothesis class. Journal of Machine Learning Research, 13, 2999–3039.MathSciNetGoogle Scholar
  122. 122.
    Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227.Google Scholar
  123. 123.
    Schmitt, M. (2005). On the capabilities of higher-order neurons: A radial basis function approach. Neural Computation, 17, 715–729.Google Scholar
  124. 124.
    Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27.Google Scholar
  125. 125.
    Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.zbMATHMathSciNetGoogle Scholar
  126. 126.
    Shao, X., Cherkassky, V., & Li, W. (2000). Measuring the VC-dimension using optimized experimental design. Neural Computing, 12, 1969–1986.Google Scholar
  127. 127.
    Shawe-Taylor, J. (1995). Sample sizes for sigmoidal neural networks. In Proceedings of the 8th Annual Conference on Computational Learning Theory (pp. 258–264). Santa Cruz, CA.Google Scholar
  128. 128.
    Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural nets. Journal of Computer and System Sciences, 50(1), 132–150.zbMATHMathSciNetGoogle Scholar
  129. 129.
    Silva, L. M., de Sa, J. M., & Alexandre, L. A. (2008). Data classification with multilayer perceptrons using a generalized error function. Neural Networks, 21, 1302–1310.zbMATHGoogle Scholar
  130. 130.
    Sima, J. (1996). Back-propagation is not efficient. Neural Networks, 9(6), 1017–1023.Google Scholar
  131. 131.
    Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems, 2, 625–640.zbMATHMathSciNetGoogle Scholar
  132. 132.
    Stoica, P., & Selen, Y. (2004). A review of information criterion rules. IEEE Signal Processing Magazine, 21(4), 36–47.Google Scholar
  133. 133.
    Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B, 36, 111–147.Google Scholar
  134. 134.
    Sugiyama, M., & Ogawa, H. (2000). Incremental active learning for optimal generalization. Neural Computation, 12, 2909–2940.Google Scholar
  135. 135.
    Sugiyama, M., & Nakajima, S. (2009). Pool-based active learning in approximate linear regression. Machine Learning, 75, 249–274.Google Scholar
  136. 136.
    Sum, J. P.-F., Leung, C.-S., & Ho, K. I.-J. (2012). On-line node fault injection training algorithm for MLP networks: Objective function and convergence analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(2), 211–222.Google Scholar
  137. 137.
    Tabatabai, M. A., & Argyros, I. K. (1993). Robust estimation and testing for general nonlinear regression models. Applied Mathematics and Computation, 58, 85–101.zbMATHMathSciNetGoogle Scholar
  138. 138.
    Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B, 58(1), 267–288.Google Scholar
  139. 139.
    Tikhonov, A. N. (1963). On solving incorrectly posed problems and method of regularization. Doklady Akademii Nauk USSR, 151, 501–504.Google Scholar
  140. 140.
    Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50, 2231–2242.MathSciNetGoogle Scholar
  141. 141.
    Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655–4666.MathSciNetGoogle Scholar
  142. 142.
    Valiant, P. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142.Google Scholar
  143. 143.
    Vapnik, V. N., & Chervonenkis, A. J. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & its Applications, 16, 264–280.zbMATHGoogle Scholar
  144. 144.
    Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer.zbMATHGoogle Scholar
  145. 145.
    Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876.Google Scholar
  146. 146.
    Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.zbMATHGoogle Scholar
  147. 147.
    Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.zbMATHGoogle Scholar
  148. 148.
    Wang, J., Kwon, S., & Shim, B. (2012). Generalized orthogonal matching pursuit. IEEE Transactions on Signal Processing, 60(12), 6202–6216.MathSciNetGoogle Scholar
  149. 149.
    Widrow, B., & Lehr, M. A. (1990). 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9), 1415–1442.Google Scholar
  150. 150.
    Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search, SFI-TR-95-02-010, Santa Fe Institute.Google Scholar
  151. 151.
    Wu, G., & Cheng, E. (2003). Class-boundary alignment for imbalanced dataset learning. In Proceedings of ICML 2003 Workshop on Learning Imbalanced Data Sets II (pp. 49–56). Washington, DC.Google Scholar
  152. 152.
    Xu, H., Caramanis, C., & Mannor, S. (2010). Robust regression and Lasso. IEEE Transactions on Information Theory, 56(7), 3561–3574.MathSciNetGoogle Scholar
  153. 153.
    Xu, H., Caramanis, C., & Mannor, S. (2012). Sparse algorithms are not stable: A no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 187–193.MathSciNetGoogle Scholar
  154. 154.
    Yang, L., Hanneke, S., & Carbonell, J. (2013). A theory of transfer learning with applications to active learning. Machine Learning, 90(2), 161–189.Google Scholar
  155. 155.
    Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of 26th Annual IEEE Symposium on Foundations Computer Science (pp. 1–10).Google Scholar
  156. 156.
    Zahalka, J., & Zelezny, F. (2011). An experimental test of Occam’s razor in classification. Machine Learning, 82, 475–481.Google Scholar
  157. 157.
    Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2), 894–942.zbMATHMathSciNetGoogle Scholar
  158. 158.
    Zhu, H. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426.Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  1. 1.Enjoyor LabsEnjoyor Inc.HangzhouChina
  2. 2.Department of Electrical and Computer EngineeringConcordia UniversityMontrealCanada

Personalised recommendations