Wasserstein discriminant analysis
Wasserstein discriminant analysis (WDA) is a new supervised linear dimensionality reduction algorithm. Following the blueprint of classical Fisher Discriminant Analysis, WDA selects the projection matrix that maximizes the ratio of the dispersion of projected points pertaining to different classes and the dispersion of projected points belonging to a same class. To quantify dispersion, WDA uses regularized Wasserstein distances. Thanks to the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples’ scale) interactions between classes. In addition, we show that WDA leverages a mechanism that induces neighborhood preservation. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; the optimization problem of WDA can be tackled using automatic differentiation of Sinkhorn’s fixed-point iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.
KeywordsLinear discriminant analysis Optimal transport Wasserstein distance
- Bach, F. R., Lanckriet, G. R., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the smo algorithm. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p. 6Google Scholar
- Courty, N., Flamary, R., Tuia, D., & Rakotomamonjy, A. (2016). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google Scholar
- Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pp. 2292–2300Google Scholar
- Cuturi, M., & Doucet, A. (2014). Fast computation of wasserstein barycenters. In ICML.Google Scholar
- Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of The 31st international conference on machine learning, pp. 647–655.Google Scholar
- Emigh, M., Kriminger, E., & Prîncipe J. C. (2015). Linear discriminant analysis with an information divergence criterion. In 2015 International joint conference on neural networks (IJCNN). IEEE, pp. 1–6Google Scholar
- Fern, X. Z., & Brodley, C. E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In ICML, Vol. 3, pp. 186–193.Google Scholar
- Flamary, R., & Courty, N. (2017). Pot python optimal transport libraryGoogle Scholar
- Frogner, C., Zhang, C., Mobahi, H., Araya, M., & Poggio, T. (2015). Learning with a wasserstein loss. In NIPS, pp. 2044–2052Google Scholar
- Giraldo, L. G. S., Principe, J. C. (2013). Information theoretic learning with infinitely divisible kernels. In Proceedings of the first international conference on representation learning (ICLR), pp. 1–8Google Scholar
- Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report. CNS-TR-2007-001, California Institute of Technology.Google Scholar
- Huang, G., Guo, C., Kusner, M.J., Sun, Y., Sha, F., Weinberger, K.Q. (2016). Supervised word mover’s distance. In: Advances in Neural Information Processing Systems, pp 4862–4870Google Scholar
- Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
- Mueller, J., & Jaakkola, T. (2015). Principal differences analysis: Interpretable characterization of differences between distributions. In NIPS, pp. 1693–1701.Google Scholar
- Petersen, K. B., Pedersen, M. S., et al. (2008). The matrix cookbook. Technical University of Denmark, 7, 15.Google Scholar
- Peyré, G., & Cuturi, M. (2018). Computational optimal transport. Foundations and Trends in Computer Science (to be published). https://optimaltransport.github.io.
- Schmidt, M. (2008). Minconf-projection methods for optimization with simple constraints in matlab.Google Scholar
- Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
- Seguy, V., & Cuturi, M. (2015). Principal geodesic analysis for probability measures under the optimal transport metric. In NIPS, pp. 3294–3302.Google Scholar
- Solomon, J., Rustamov, R., Leonidas, G., & Butscher, A. (2014). Wasserstein propagation for semi-supervised learning. In ICML, pp. 306–314.Google Scholar
- Tangkaratt, V., Sasaki, H., & Sugiyama, M. (2015). Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction. arXiv preprint arXiv:1508.01019.
- Van Der Maaten, L., Postma, E., & Van den Herik, J. (2009). Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10, 66–71.Google Scholar
- Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems, 15, 505–512.Google Scholar