Leaving the Span

  • Manfred K. Warmuth
  • S. V. N. Vishwanathan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3559)


We discuss a simple sparse linear problem that is hard to learn with any algorithm that uses a linear combination of the training instances as its weight vector. The hardness holds even if we allow the learner to embed the instances into any higher dimensional feature space (and use a kernel function to define the dot product between the embedded instances). These algorithms are inherently limited by the fact that after seeing k instances only a weight space of dimension k can be spanned.

Our hardness result is surprising because the same problem can be efficiently learned using the exponentiated gradient (EG) algorithm: Now the component-wise logarithms of the weights are essentially a linear combination of the training instances and after seeing k instances. This algorithm enforces additional constraints on the weights (all must be non-negative and sum to one) and in some cases these constraints alone force the rank of the weight space to grow as fast as 2 k .


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Azoury, K., Warmuth, M.K.: Relative loss bounds for on-line density estimation with the exponential family of distributions. Machine Learning 43(3), 211–246 (2001); Special issue on Theoretical Advances in On-line Learning, Game Theory and BoostingMATHCrossRefGoogle Scholar
  2. Ben-David, S., Eiron, N., Simon, H.U.: Limitations of learning via embeddings in Euclidean half-spaces. Journal of Machine Learning Research 3, 441–461 (2002)CrossRefMathSciNetGoogle Scholar
  3. Cristianini, N., Campbell, C., Shawe-Taylor, J.: Multiplicative updatings for support vector learning. In: Proc. of European Symposium on Artificial Neural Networks, pp. 189–194 (1999)Google Scholar
  4. Davidson, K.R., Szarek, S.J.: Banach space theory and local operator theory. In: Lindenstrauss, J., Johnson, W. (eds.) Handbook of the Geometry of Banach Spaces, North-Holland, Amsterdam (2003)Google Scholar
  5. Forster, J., Schmitt, N., Simon, H.U.: Estimating the optimal margins of embeddings in Euclidean half spaces. In: Proc. of the 14th Annual Conference on Computational Learning Theory, pp. 402–415. Springer, Heidelberg (2001)Google Scholar
  6. Gentile, C., Littlestone, N.: The robustness of the p-norm algorithms. In: Proc. 12th Annu. Conf. on Comput. Learning Theory, pp. 1–11. ACM Press, New York (1999)Google Scholar
  7. Gentile, C., Warmuth, M.K.: Linear hinge loss and average margin. In: Kearns, M.S., Solla, S.A., Cohn, D.A. (eds.) Advances in Neural Information Processing Systems 11, Cambridge, MA, pp. 225–231. MIT Press, Cambridge (1999)Google Scholar
  8. Helmbold, D.P., Kivinen, J., Warmuth, M.K.: Relative loss bounds for single neurons. IEEE Transactions on Neural Networks 10(6), 1291–1304 (1999)CrossRefGoogle Scholar
  9. Herbrich, R., Graepel, T., Williamson, R.C.: Innovations in Machine Learning. In: Holmes, D., Jain, L.C. (eds.) The Structure of Version Space, January 2005, Springer, Heidelberg (2005)Google Scholar
  10. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)MATHGoogle Scholar
  11. Khardon, R., Roth, D., Servedio, R.: Efficiency versus convergence of Boolean kernels for on-line learning algorithms. In: Advances in Neural Information Processing Systems 14, pp. 423–430 (2001)Google Scholar
  12. Kimeldorf, G.S., Wahba, G.: Some results on Tchebycheffian spline functions. J. Math. Anal. Applic. 33, 82–95 (1971)MATHCrossRefMathSciNetGoogle Scholar
  13. Kivinen, J., Warmuth, M.K.: Relative loss bounds for multidimensional regression problems. Machine Learning 45(3), 301–329 (2001)MATHCrossRefGoogle Scholar
  14. Kivinen, J., Warmuth, M.K.: Exponentiated gradient versus gradient descent for linear predictors. Information and Computation 132(1), 1–64 (1997)MATHCrossRefMathSciNetGoogle Scholar
  15. Kivinen, J., Warmuth, M.K., Auer, P.: The perceptron learning algorithm vs. Winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence 97(1 - 2), 325–343 (1997)MATHCrossRefMathSciNetGoogle Scholar
  16. Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning 2, 285–318 (1988)Google Scholar
  17. Meckes, M.W.: Concentration of norms and eigenvalues of random matrices. Journal of Functional Analysis 211(2), 508–524 (2004)MATHCrossRefMathSciNetGoogle Scholar
  18. Pitt, L., Warmuth, M.K.: The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the ACM 40(1), 95–142 (1993)MATHCrossRefMathSciNetGoogle Scholar
  19. Schapire, R., Freund, Y., Bartlett, P.L., Lee, W.S.: Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics 26, 1651–1686 (1998)MATHCrossRefMathSciNetGoogle Scholar
  20. Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Helmbold, D.P., Williamson, B. (eds.) COLT 2001 and EuroCOLT 2001. LNCS (LNAI), vol. 2111, pp. 416–426. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  21. Takimoto, E., Warmuth, M.K.: Path kernels and multiplicative updates. Journal of Machine Learning Research 4, 773–818 (2003)CrossRefMathSciNetGoogle Scholar
  22. Warmuth, M.K.: Towards representation independence in PAC-learning. In: Jantke, J.P. (ed.) AII 1989. LNCS (LNAI), vol. 397, pp. 78–103. Springer, Heidelberg (1989)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Manfred K. Warmuth
    • 1
  • S. V. N. Vishwanathan
    • 2
  1. 1.Computer Science DepartmentUniversity of CaliforniaSanta CruzU.S.A
  2. 2.Machine Learning ProgramNational ICT AustraliaCanberraAustralia

Personalised recommendations