Machine Learning

, Volume 107, Issue 12, pp 1923–1945 | Cite as

Wasserstein discriminant analysis

  • Rémi FlamaryEmail author
  • Marco Cuturi
  • Nicolas Courty
  • Alain Rakotomamonjy


Wasserstein discriminant analysis (WDA) is a new supervised linear dimensionality reduction algorithm. Following the blueprint of classical Fisher Discriminant Analysis, WDA selects the projection matrix that maximizes the ratio of the dispersion of projected points pertaining to different classes and the dispersion of projected points belonging to a same class. To quantify dispersion, WDA uses regularized Wasserstein distances. Thanks to the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples’ scale) interactions between classes. In addition, we show that WDA leverages a mechanism that induces neighborhood preservation. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; the optimization problem of WDA can be tackled using automatic differentiation of Sinkhorn’s fixed-point iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.


Linear discriminant analysis Optimal transport Wasserstein distance 


  1. Absil, P. A., Mahony, R., & Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton: Princeton University Press.zbMATHGoogle Scholar
  2. Bach, F. R., Lanckriet, G. R., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the smo algorithm. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p. 6Google Scholar
  3. Benamou, J. D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138.MathSciNetCrossRefGoogle Scholar
  4. Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computation, 12(8), 1889–1900.MathSciNetCrossRefGoogle Scholar
  5. Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends\(\textregistered \). Machine Learning, 2(1), 1–127.MathSciNetCrossRefGoogle Scholar
  6. Bonnans, J. F., & Shapiro, A. (1998). Optimization problems with perturbations: A guided tour. SIAM Review, 40(2), 228–264.MathSciNetCrossRefGoogle Scholar
  7. Bonneel, N., Peyré, G., & Cuturi, M. (2016). Wasserstein barycentric coordinates: Histogram regression using optimal transport. ACM Transactions on Graphics, 35(4), 71:1–71:10.CrossRefGoogle Scholar
  8. Boumal, N., Mishra, B., Absil, P. A., & Sepulchre, R. (2014). Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1), 1455–1459.zbMATHGoogle Scholar
  9. Burges, C. J. (2010). Dimension reduction: A guided tour. Boston: Now Publishers.zbMATHGoogle Scholar
  10. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3), 131–159.CrossRefGoogle Scholar
  11. Colson, B., Marcotte, P., & Savard, G. (2007). An overview of bilevel optimization. Annals of Operations Research, 153(1), 235–256.MathSciNetCrossRefGoogle Scholar
  12. Courty, N., Flamary, R., Tuia, D., & Rakotomamonjy, A. (2016). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google Scholar
  13. Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pp. 2292–2300Google Scholar
  14. Cuturi, M., & Doucet, A. (2014). Fast computation of wasserstein barycenters. In ICML.Google Scholar
  15. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of The 31st international conference on machine learning, pp. 647–655.Google Scholar
  16. Emigh, M., Kriminger, E., & Prîncipe J. C. (2015). Linear discriminant analysis with an information divergence criterion. In 2015 International joint conference on neural networks (IJCNN). IEEE, pp. 1–6Google Scholar
  17. Fern, X. Z., & Brodley, C. E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In ICML, Vol. 3, pp. 186–193.Google Scholar
  18. Flamary, R., & Courty, N. (2017). Pot python optimal transport libraryGoogle Scholar
  19. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. Springer series in statistics. Berlin: Springer.zbMATHGoogle Scholar
  20. Frogner, C., Zhang, C., Mobahi, H., Araya, M., & Poggio, T. (2015). Learning with a wasserstein loss. In NIPS, pp. 2044–2052Google Scholar
  21. Giraldo, L. G. S., Principe, J. C. (2013). Information theoretic learning with infinitely divisible kernels. In Proceedings of the first international conference on representation learning (ICLR), pp. 1–8Google Scholar
  22. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report. CNS-TR-2007-001, California Institute of Technology.Google Scholar
  23. Huang, G., Guo, C., Kusner, M.J., Sun, Y., Sha, F., Weinberger, K.Q. (2016). Supervised word mover’s distance. In: Advances in Neural Information Processing Systems, pp 4862–4870Google Scholar
  24. Knight, P. A. (2008). The Sinkhorn–Knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1), 261–275.MathSciNetCrossRefGoogle Scholar
  25. Koep, N., & Weichwald, S. (2016). Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17, 1–5.MathSciNetzbMATHGoogle Scholar
  26. Lichman, M. (2013). UCI machine learning repository.
  27. Mueller, J., & Jaakkola, T. (2015). Principal differences analysis: Interpretable characterization of differences between distributions. In NIPS, pp. 1693–1701.Google Scholar
  28. Petersen, K. B., Pedersen, M. S., et al. (2008). The matrix cookbook. Technical University of Denmark, 7, 15.Google Scholar
  29. Peyré, G., & Cuturi, M. (2018). Computational optimal transport. Foundations and Trends in Computer Science (to be published).
  30. Schmidt, M. (2008). Minconf-projection methods for optimization with simple constraints in matlab.Google Scholar
  31. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
  32. Seguy, V., & Cuturi, M. (2015). Principal geodesic analysis for probability measures under the optimal transport metric. In NIPS, pp. 3294–3302.Google Scholar
  33. Solomon, J., Rustamov, R., Leonidas, G., & Butscher, A. (2014). Wasserstein propagation for semi-supervised learning. In ICML, pp. 306–314.Google Scholar
  34. Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. The Journal of Machine Learning Research, 8, 1027–1061.zbMATHGoogle Scholar
  35. Suzuki, T., & Sugiyama, M. (2013). Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation, 25(3), 725–758.MathSciNetCrossRefGoogle Scholar
  36. Tangkaratt, V., Sasaki, H., & Sugiyama, M. (2015). Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction. arXiv preprint arXiv:1508.01019.
  37. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579–2605), 85.zbMATHGoogle Scholar
  38. Van Der Maaten, L., Postma, E., & Van den Herik, J. (2009). Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10, 66–71.Google Scholar
  39. Villani, C. (2008). Optimal transport: Old and new (Vol. 338). Berlin: Springer.zbMATHGoogle Scholar
  40. Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10, 207–244.zbMATHGoogle Scholar
  41. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems, 15, 505–512.Google Scholar
  42. Zhang, L., Dong, W., Zhang, D., & Shi, G. (2010). Two-stage image denoising by principal component analysis with local pixel grouping. Pattern Recognition, 43(4), 1531–1549.CrossRefGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  1. 1.Lagrange, Observatoire de la Côte d’AzurUniversité Côte d’AzurNiceFrance
  2. 2.CREST, ENSAE, Campus Paris-SaclayPalaiseauFrance
  3. 3.Laboratoire IRISACampus de TohannicVannesFrance
  4. 4.LITIS EA4108Université Rouen NormandieSaint-Etienne du RouvrayFrance

Personalised recommendations