Intrinsic Geometries in Learning

  • Richard Nock
  • Frank Nielsen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5416)


In a seminal paper, Amari (1998) proved that learning can be made more efficient when one uses the intrinsic Riemannian structure of the algorithms’ spaces of parameters to point the gradient towards better solutions. In this paper, we show that many learning algorithms, including various boosting algorithms for linear separators, the most popular top-down decision-tree induction algorithms, and some on-line learning algorithms, are spawns of a generalization of Amari’s natural gradient to some particular non-Riemannian spaces. These algorithms exploit an intrinsic dual geometric structure of the space of parameters in relationship with particular integral losses that are to be minimized. We unite some of them, such as AdaBoost, additive regression with the square loss, the logistic loss, the top-down induction performed in CART and C4.5, as a single algorithm on which we show general convergence to the optimum and explicit convergence rates under very weak assumptions. As a consequence, many of the classification calibrated surrogates of Bartlett et al. (2006) admit efficient minimization algorithms.


Exponential Family Weak Learner Empirical Risk Linear Separator Hinge Loss 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Amari, S.-I.: Natural Gradient works efficiently in Learning. Neural Computation 10, 251–276 (1998)CrossRefGoogle Scholar
  2. 2.
    Azran, A., Meir, R.: Data dependent risk bounds for hierarchical mixture of experts classifiers. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS, vol. 3120, pp. 427–441. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Banerjee, A., Guo, X., Wang, H.: On the optimality of conditional expectation as a bregman predictor. IEEE Trans. on Information Theory 51, 2664–2669 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering with Bregman divergences. Journal of Machine Learning Research 6, 1705–1749 (2005)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Bartlett, P., Jordan, M., McAuliffe, J.D.: Convexity, classification, and risk bounds. Journal of the Am. Stat. Assoc. 101, 138–156 (2006)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bartlett, P., Traskin, M.: Adaboost is consistent. In: NIPS*19 (2006)Google Scholar
  7. 7.
    Blake, C.L., Keogh, E., Merz, C.J.: UCI repository of machine learning databases (1998),
  8. 8.
    Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys. 7, 200–217 (1967)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Breiman, L., Freidman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Wadsworth (1984)Google Scholar
  10. 10.
    Collins, M., Schapire, R., Singer, Y.: Logistic regression, adaboost and Bregman distances. In: COLT 2000, pp. 158–169 (2000)Google Scholar
  11. 11.
    Davis, J., Kulis, B., Jain, P., Sra, S., Dhillon, I.: Information-theoretic metric learning. In: ICML 2007 (2007)Google Scholar
  12. 12.
    Dhillon, I., Sra, S.: Generalized non-negative matrix approximations with Bregman divergences. In: Advances in Neural Information Processing Systems, vol. 18 (2005)Google Scholar
  13. 13.
    Freund, Y., Schapire, R.E.: A Decision-Theoretic generalization of on-line learning and an application to Boosting. Journal of Comp. Syst. Sci. 55, 119–139 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Friedman, J., Hastie, T., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. Ann. of Stat. 28, 337–374 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Gates, G.W.: The Reduced Nearest Neighbor rule. IEEE Trans. on Information Theory 18, 431–433 (1972)CrossRefGoogle Scholar
  16. 16.
    Gentile, C., Warmuth, M.: Linear hinge loss and average margin. In: NIPS*11, pp. 225–231 (1998)Google Scholar
  17. 17.
    Gentile, C., Warmuth, M.: Proving relative loss bounds for on-line learning algorithms using Bregman divergences. In: Tutorials of the 13 th International Conference on Computational Learning Theory (2000)Google Scholar
  18. 18.
    Grünwald, P., Dawid, P.: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. of Statistics 32, 1367–1433 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Henry, C., Nock, R., Nielsen, F.: IReal boosting a la Carte with an application to boosting Oblique Decision Trees. In: Proc. of the 21 st International Joint Conference on Artificial Intelligence, pp. 842–847 (2007)Google Scholar
  20. 20.
    Herbster, M., Warmuth, M.: Tracking the best regressor. In: COLT 1998, pp. 24–31 (1998)Google Scholar
  21. 21.
    Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT Press, Cambridge (1994)Google Scholar
  22. 22.
    Kearns, M.J.: Thoughts on hypothesis boosting, ML class project (1988)Google Scholar
  23. 23.
    Kearns, M.J., Mansour, Y.: On the boosting ability of top-down decision tree learning algorithms. Journal of Comp. Syst. Sci. 58, 109–128 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Kearns, M.J., Valiant, L.: Cryptographic limitations on learning boolean formulae and finite automata. In: Proc. of the 21 th ACM Symposium on the Theory of Computing, pp. 433–444 (1989)Google Scholar
  25. 25.
    Kivinen, J., Warmuth, M., Hassibi, B.: The p-norm generalization of the LMS algorithm for adaptive filtering. IEEE Trans. on Signal Processing 54, 1782–1793 (2006)CrossRefGoogle Scholar
  26. 26.
    Kohavi, R.: The power of Decision Tables. In: Proc. of the 10 th European Conference on Machine Learning, pp. 174–189 (1995)Google Scholar
  27. 27.
    Matsushita, K.: Decision rule, based on distance, for the classification problem. Ann. of the Inst. for Stat. Math. 8, 67–77 (1956)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Mitchell, T.M.: The need for biases in learning generalization. Technical Report CBM-TR-117, Rutgers University (1980)Google Scholar
  29. 29.
    Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of \({\mathcal{U}}\)-Boost and Bregman divergence. Neural Computation, 1437–1481 (2004)Google Scholar
  30. 30.
    Nielsen, F., Boissonnat, J.-D., Nock, R.: On Bregman Voronoi diagrams. In: Proc. of the 19 th ACM-SIAM Symposium on Discrete Algorithms, pp. 746–755 (2007)Google Scholar
  31. 31.
    Nielsen, F., Boissonnat, J.-D., Nock, R.: Bregman Voronoi Diagrams: properties, algorithms and applications, 45 p. (submission, 2008)Google Scholar
  32. 32.
    Nock, R.: Inducing interpretable Voting classifiers without trading accuracy for simplicity: theoretical results, approximation algorithms, and experiments. Journal of Artificial Intelligence Research 17, 137–170 (2002)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Nock, R., Nielsen, F.: A ℝeal Generalization of discrete AdaBoost. Artif. Intell. 171, 25–41 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  34. 34.
    Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  35. 35.
    Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of statistics 26, 1651–1686 (1998)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning Journal 37, 297–336 (1999)CrossRefzbMATHGoogle Scholar
  37. 37.
    Valiant, L.G.: A theory of the learnable. Communications of the ACM 27, 1134–1142 (1984)CrossRefzbMATHGoogle Scholar
  38. 38.
    Vapnik, V.: Statistical Learning Theory. John Wiley, Chichester (1998)zbMATHGoogle Scholar
  39. 39.
    Warmuth, M., Liao, J., Rätsch, G.: Totally corrective boosting algorithms that maximize the margin. In: ICML 2006, pp. 1001–1008 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Richard Nock
    • 1
  • Frank Nielsen
    • 2
    • 3
  1. 1.CEREGMIAUniversité Antilles-GuyaneSchoelcherFrance
  2. 2.LIXEcole PolytechniquePalaiseauFrance
  3. 3.Sony Computer Science Laboratories Inc.TokyoJapan

Personalised recommendations