Abstract
Wasserstein discriminant analysis (WDA) is a new supervised linear dimensionality reduction algorithm. Following the blueprint of classical Fisher Discriminant Analysis, WDA selects the projection matrix that maximizes the ratio of the dispersion of projected points pertaining to different classes and the dispersion of projected points belonging to a same class. To quantify dispersion, WDA uses regularized Wasserstein distances. Thanks to the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples’ scale) interactions between classes. In addition, we show that WDA leverages a mechanism that induces neighborhood preservation. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; the optimization problem of WDA can be tackled using automatic differentiation of Sinkhorn’s fixed-point iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.
Keywords
Linear discriminant analysis Optimal transport Wasserstein distance1 Introduction
Feature learning is a crucial component in many applications of machine learning. New feature extraction methods or data representations are often responsible for breakthroughs in performance, as illustrated by the kernel trick in support vector machines (Schölkopf and Smola 2002) and their feature learning counterpart in multiple kernel learning (Bach et al. 2004), and more recently by deep architectures (Bengio 2009).
Among all the feature extraction approaches, one major family of dimensionality reduction methods (Van Der Maaten et al. 2009; Burges 2010) consists in estimating a linear subspace of the data. Although very simple, linear subspaces have many advantages. They are easy to interpret, and can be inverted, at least in a lest-squares way. This latter property has been used for instance in PCA denoising (Zhang et al. 2010). Linear projection is also a key component in random projection methods (Fern and Brodley 2003) or compressed sensing and is often used as a first pre-processing step, such as the linear part in a neural network layer. Finally, linear projections only imply matrix products and stream therefore particularly well on any type of hardware (CPU, GPU, DSP).
Linear dimensionality reduction techniques come in all flavors. Some of them, such as PCA, are inherently unsupervised; some can consider labeled data and fall in the supervised category. We consider in this paper linear and supervised techniques. Within that category, two families of methods stand out: Given a dataset of pairs of vectors and labels \(\{(\mathbf {x}_i,y_i)\}_i\), with \(\mathbf {x}_i \in \mathbb {R}^d\), the goal of Fisher Discriminant Analysis (FDA) and variants is to learn a linear map \(\mathbf {P}:\mathbb {R}^d \rightarrow \mathbb {R}^p\), \(p\ll d\), such that the embeddings of these points \(\mathbf {P}\mathbf {x}_i\) can be easily discriminated using linear classifiers. Mahalanobis metric learning (MML) follows the same approach, except that the quality of the embedding \(\mathbf {P}\) is judged by the ability of a k-nearest neighbor algorithm (not a linear classifier) to obtain good classification accuracy.
1.1 FDA and MML, in both global and local flavors
FDA attempts to maximize w.r.t. \(\mathbf {P}\) the sum of all distances \(\Vert \mathbf {P}\mathbf {x}_i-\mathbf {P}\mathbf {x}_{j'} \Vert \) between pairs of samples from different classes \(c,c'\) while minimizing the sum of all distances \(\Vert \mathbf {P}\mathbf {x}_i-\mathbf {P}\mathbf {x}_j \Vert \) between pairs of samples within the same class c (Friedman et al. 2001, §4.3). Because of this, it is well documented that the performance of FDA degrades when class distributions are multimodal. Several variants have been proposed to tackle this problem (Friedman et al. 2001, §12.4). For instance, a localized version of FDA was proposed by Sugiyama (2007), which boils down to discarding the computation for all pairs of points that are not neighbors.
1.2 WDA: global and local
When \(\lambda \) is small, we will show that WDA boils down to FDA. When \(\lambda \) is large, WDA tries to split apart distributions of classes by maximizing their optimal transport distance. In that process, for a given example \(\mathbf {x}_i\) in one class, only few components \(T_{i,j}\) will be activated so that \(\mathbf {x}_i\) will be paired with few examples. Figure 1 illustrates how pairing weights \(T_{i,j}\) are defined when comparing Wasserstein discriminant analysis (WDA, with different regularization strengths) with FDA (purely global), and Local-FDA (purely local) (Sugiyama 2007). Another strong feature brought by regularized Wasserstein distances is that relations between samples (as given by the optimal transport matrix \(\mathbf {T}\)) are estimated in the projected space. This is an important difference compared to all previous local approaches which estimate local relations in the original space and make the hypothesis that these relations are unchanged after projection.
1.3 Paper outline
Section 2 provides background on regularized Wasserstein distances. The WDA criterion and its practical optimization is presented in Sect. 3. Section 4 by discusses properties of WDA and related works. Numerical experiments are provided in Sect. 5. Section 6 concludes the paper and introduces perspectives.
2 Background on Wasserstein distances
Wasserstein distances, also known as earth mover distances, define a geometry over the space of probability measures using principles from optimal transport theory (Villani 2008). Recent computational advances (Cuturi 2013; Benamou et al. 2015) have made them scalable to dimensions relevant to machine learning applications.
2.1 Notations and definitions
2.2 Regularized Wassersein distance
3 Wasserstein discriminant analysis
In this section we discuss optimization problem (1) and propose an efficient approach to compute the gradient of its objective.
3.1 Optimization problem
3.2 Automatic differentiation
A possible way to compute the Jacobian \(\partial \mathbf {T}^{c,c'}/\partial \mathbf {P}\) is to use the implicit function theorem as in hyperparameter estimation in ML (Bengio 2000; Chapelle et al. 2002). We detail that approach in the appendix but it requires inverting a very large matrix, and does not scale in practice. It also assumes that the exact optimal transport \(\mathbf {T}_\lambda \) is obtained at each iteration, which is clearly an approximation since we only have the computational budget for a finite, and usually small, number of Sinkhorn iterations.
3.3 Algorithm
In the above subsections, we have reformulated the WDA optimization problem so as to make it tractable. We have derived closed-form expressions of some elements of the gradient as well as an automatic differentiation strategy for computing gradients of the transport plans \(\mathbf {T}^{c,c^\prime }\) with respects to \(\mathbf {P}\).
4 Discussions
4.1 Wasserstein discriminant analysis: local and global
As we have stated, WDA allows construction of both a local and global interaction of the empirical distributions to compare. Globality naturally results from the Wasserstein distance, which is a metric on probability measures, and as such it measures discrepancy between distributions at whole level. Note however that this property would have been shared by any other metric on probability measures. Locality comes as a specific feature of regularized Wasserstein distance. Indeed as made clear by the solution in Eq. (4) of the entropy-smoothed optimal transport problem, weights \(T_{ij}\) tend to be larger for nearby points with an exponential decrease with respect to distance between \(\mathbf {P}\mathbf {x}_i\) and \(\mathbf {P}\mathbf {x}_{j'}\).
4.2 Regularized Wasserstein distance and Fisher criterion
Fisher criterion for measuring separability stands on the ratio of inter-class and intra-class variability of samples. However, this intra-class variability can be challenging to evaluate when information regarding probability distributions come only through empirical examples. Indeed, the classical \((\lambda = \infty )\) Wasserstein distance of a discrete distribution with itself is 0, as with any other metrics for empirical distributions. Recent result by Mueller and Jaakkola (2015) also suggests that even splitting examples from one given class and computing Wasserstein distance between resulting empirical distributions will result in arbitrary small distance with high probability. This is why entropy-regularized Wasserstein distance plays a key role in our algorithm, as to the best of our knowledge, no other metrics on empirical distributions would lead to relevant intra-class measures. Indeed, \(W_\lambda (\mathbf {P}\mathbf {X},\mathbf {P}\mathbf {X}) = \langle \mathbf {P}^\top \mathbf {P}, \mathbf {C}\rangle \) with \(\mathbf {C}= \sum _{i,j} T_{i,j} (\mathbf {x}_i - \mathbf {x}_j) (\mathbf {x}_i - \mathbf {x}_j)^T\). Hence, since \(\lambda < \infty \) ensures that mass of a given sample is split among its neighbours by the transport map \(\mathbf {T}\), \(W_\lambda (\mathbf {P}\mathbf {X},\mathbf {P}\mathbf {X})\) is thus non-zero and interestingly, it depends on a weighted covariance matrix \(\mathbf {C}\) which, because it depends on \(\mathbf {T}\), will put more emphasis on couples of neighbour examples.
More formally, we can show that minimizing \(W_\lambda (\mathbf {P}\mathbf {X},\mathbf {P}\mathbf {X})\) with respect to \(\mathbf {P}\) induces a neighbourhood preserving map \(\mathbf {P}\). This means that if an example i is closer to an example j than an example k in the original space, this relation should be preserved in the projected space. This implies that \(\Vert \mathbf {P}\mathbf {x}_i - \mathbf {P}\mathbf {x}_j\Vert _2\) should be smaller than \(\Vert \mathbf {P}\mathbf {x}_i - \mathbf {P}\mathbf {x}_k\Vert _2\). Then, this neighbourhood preservation can be enforced if \(K_{i,j} > K_{i,k}\), which is equivalent to \(M_{i,j} < M_{i,k}\), implies \(T_{i,j} > T_{i,k}\). Hence, since \(W_\lambda (\mathbf {P}\mathbf {X},\mathbf {P}\mathbf {X}) = \sum _{i,j} T_{i,j} \Vert \mathbf {P}\mathbf {x}_i - \mathbf {P}\mathbf {x}_j\Vert _2^2\), the inequality \(T_{i,j} > T_{i,k}\) means that examples that are close in the input space are encouraged to be close in the projected space. We show next that there exists situation in which this condition is guaranteed.
Proposition 1
Proof
Note that this proposition provides us with a guarantee on a ratio \(\frac{K_{i,k}}{K_{i,j}}\) between examples that induces preservation of neighbourhood. However, the constant A we exhibit here is probably loose and thus, a larger ratio may still preserve locality.
4.3 Connection to Fisher discriminant analysis
Following this connection, we want to stress again the role played by the within-class Wasserstein distances in WDA. At first, from a theoretical point of view, optimizing the ratio instead of just maximizing the between-class distance allows us to encompass well-known method such as FDA. Secondly, as we have shown in the previous subsection, minimizing the within-class distance provides interesting features such as neighbourhood preservation under mild condition.
Another intuitive benefit of minimizing the within-class distance is the following. Suppose we have several projection maps that lead to the same optimal transport matrix \(\mathbf {T}\). Since \(W_\lambda (\mathbf {P}\mathbf {X},\mathbf {P}\mathbf {X}) = \sum _{i,j} T_{i,j} \Vert \mathbf {P}\mathbf {x}_i - \mathbf {P}\mathbf {x}_j\Vert _2^2\) for any \(\mathbf {P}\), minimizing \(W_\lambda (\mathbf {P}\mathbf {X},\mathbf {P}\mathbf {X})\) with respect to \(\mathbf {P}\) means preferring the projection map that yields to the smaller weighted (according to \(\mathbf {T}\)) pairwise distance of samples in the projected space. Since for an example i, \(\{T_{i,j}\}_j\) are mainly non-zero among neighbours of i, minimizing the within-class distance favours projection maps that tend to tighly cluster points in the same class.
4.4 Relation to other information-theoretic discriminant analysis
Several information-theoretic criteria have been considered for discriminant analysis and dimensionality reduction. Compared to Fisher’s criteria, these ones have the advantage of going beyond a simple sketching of the data pdf based on second-order statistics. Two recent approaches are based on the idea of maximizing distance of probability distributions of data in the projection subspaces. They just differ in the choice of the metrics of pdf (one being a \(L_2\) distance (Emigh et al. 2015) and the second one being a Wasserstein distance (Mueller and Jaakkola 2015)). While our approach also seeks at finding projection that maximizes pdf distance, it has also the unique feature of finding projections that preserves neighbourhood. Other recent approaches have addressed the problem of supervised dimensionality reduction algorithms still from an information theoretic learning perspective but without directly maximizing distance of pdf in the projected subpaces. We discuss two methods to which we have compared with in the experimental analysis. The approach of Suzuki and Sugiyama (2013), denoted as LSDR, seeks at finding a low-rank subspace of inputs that contains sufficient information for predicting output values. In their works, the authors define the notion of sufficiency through conditional independence of the outputs and the inputs given the projected inputs and evaluate this measure through squared-loss mutual information. One major drawback of their approach is that they need to estimate a density ratio introducing thus an extra layer of complexity and an error-prone task. Similar idea has been investigated by Tangkaratt et al. (2015) as they used quadratic mutual information for evaluating statistical dependence between projected inputs and outputs (the method has been named LSQMI). While they avoid the estimation of density ratio, they still need to estimate derivatives of quadratic mutual information. Like our approach, the method of Giraldo and Principe (2013) avoids density estimation for performing supervised metric learning. Indeed, the key aspect of their work is to show that the Gram matrix of some data samples can be related to some information theoretic quantities such as conditional entropy without the need of estimating pdfs. Based on this finding, they introduced a metric learning approach, coined CEML, by minimizing conditional entropy between labels and projected samples. While their approach is appealing, we believe that a direct criterion such as Fisher’s is more relevant and robust for classification purposes, as proved in our experiments.
4.5 Wasserstein distances and machine learning
Wasserstein distances are mainly derived from the theory of optimal transport (Villani 2008), and provide a useful way to compare probability measures. Its practical deployment in machine learning problems has been alleviated thanks to regularized versions of the original problem (Cuturi 2013; Benamou et al. 2015). The geometry of the space of probability measures endowed with the Wasserstein metric allows to consider various objects of interest such as means or barycenters (Cuturi and Doucet 2014; Benamou et al. 2015), and has led to generalization of PCA in the space of probability measures (Seguy and Cuturi 2015). It has been considered in the problem of semi-supervised learning (Solomon et al. 2014), domain adaptation (Courty et al. 2016), or definition of loss functions (Frogner et al. 2015). More recently, it has also been considered in a subspace identification problem for analyzing the differences between distributions (Mueller and Jaakkola 2015), but contrary to our approach, they only consider projections to univariate distributions, and as such do not permit to find subspaces with dimension \(>1\). More recent works have proposed to use Wasserstein for measuring similarity between documents in Huang et al. (2016) and propose to learn a metric that encodes class information between samples. Note that in our work we use Wasserstein between the empirical distributions and not the training samples yielding a very different approach.
5 Numerical experiments
In this section we illustrate how WDA works on several learning problems. First, we evaluate our approach on a simple simulated dataset with a 2-dimensional discriminative subspace. Then, we benchmark WDA on MNIST and Caltech datasets with some pre-defined hyperparameter settings for methods having some. Unless specified and justified, for LFDA and LMNN, we have set the number of neighours to 5. For CEML, Gaussian kernel width \(\sigma \) has been fixed to \(\sqrt{p}\), which is the value used by Giraldo and Principe (2013) across all their experiments. For WDA, we have chosen \(\lambda = 0.01\) except for the toy problem. The final experiment compares performance of WDA and competitors on some UCI dataset problems, in which relevant parameters have been validated.
Note that in the spirit of reproducible research the code will be made available to the community and the Python implementation of WDA is available as part of the POT for Python Optimal Transport Toolbox (Flamary and Courty 2017) on Github.^{1}
5.1 Practical implementation
In order to make the method less sensitive to the dimension and scaling of the data, we propose to use a pre-computed adaptive regularization parameter for each Wasserstein distances in (1). Denote as \(\lambda _{c,c'}\) such parameter yielding thus to a distance \(W_{\lambda _{c,c'}}\). In practice, we initialize \(\mathbf {P}\) with the PCA projection, and define \(\lambda _{c,c'}\) as \(\lambda _{c,c'}=\lambda (\frac{1}{n_cn_{c'}}\sum _{i,j}\Vert \mathbf {P}x_i^c-\mathbf {P}x_j^{c'}\Vert ^2)^{-1}\) between class c and \(c'\). These values are computed a priori and fixed in the remaining iterations. They have the advantage to promote a similar regularization strength between inter and intra-class distances.
5.2 Simulated dataset
Figure 2 also illustrates the projection of test samples in two-dimensional subspaces obtained from the different approaches. We can see that for this dataset WDA, LDSR and CEML lead to a good discriminant subspace. This illustrate the importance of estimating relations between samples in the projected space as opposed to the original space as done in LMNN and LFDA. Quantitative results are illustrated in Fig. 3 (left) where we reported prediction error for a K-Nearest-Neighbors classifier (KNN) for \(n=100\) training examples and \(n_t=5000\) test examples. In this simulation, all prediction errors are averaged over 20 data generations and the neighbors parameters of LMNN and LFDA have been selected empirically to maximize performances (respectively 5 for LMNN and 1 for LFDA). We can see in the left part of the figure that WDA, LDSR and CEML and to a lesser extent LMNN can estimate the relevant subspace, when the optimal dimension value is given to them,that is robust to the choice of K. Note that slightly better performances are achieved by LSDR and CEML. In the right plot of Fig. 3, we show the performances of all algorithms when varying the dimension of the projected space. We note that WDA, LMNN, LSDR and LFDA achieve their best performances for \(p=2\) and that prediction errors rapidly increase as p is misspecified. Instead, CEML performs very well for \(p \ge 2\). Being sensitive to the correct projected space dimensionality can be considered as an asset, as typically this dimension is to be optimized (e.g by cross-validation), making it easier to spot the best dimension reduction. At the contrary, CEML is robust to projected space dimensionality mis-specification at the expense of under-estimating the best reduction of dimension.
5.3 MNIST dataset
Our objective with this experiment is to measure how robust our approach is with only few training samples despite high-dimensionality of the problem. To this end, we draw \(n=1000\) samples for training and report the KNN prediction error as a function of k for the different subspace methods when projecting onto \(p=10\) and \(p=20\) dimensions (resp. left and right plots of Fig. 5). The reported scores are averages of 20 realizations of the same experiment. We also limit the analysis to \(L=10\) as the number of Sinkhorn fixed point iterations and \(\lambda =0.01\). For both p, WDA finds a better subspace than the original space which suggests that most of the discriminant information available in the training dataset has been correctly extracted. Conversely, the other approaches struggle to find a relevant subspace in this configuration. In addition to better prediction performance, we want to emphasize that in this configuration, WDA leads to a dramatic compression of the data from 784 to 10 or 20 features while preserving most of the discriminative information.
To gain a better understanding of the corresponding embedding, we have further projected the data from the 10-dimensional space to a 2-dimensional one using t-SNE (Van der Maaten and Hinton 2008). In order to make the embeddings comparable, we have used the same initializations of t-SNE for all methods. The resulting 2D projections on the test samples are shown in Fig. 6. We can clearly see the overfitting behaviour of FDA, LFDA, LMNN and LDSR that separate accurately the training samples but fail to separate the test samples. Instead, WDA is able to disentangle classes in the training set while preserving generalization abilities.
5.4 Caltech dataset
In this experiment, we use a subset described by Donahue et al. (2014) of the Caltech-256 image collection (Griffin et al. 2007). The dataset uses features that are the output of the DeCAF deep learning architecture (Donahue et al. 2014). More precisely, they are extracted as the sparse activation of the neurons from the 6th fully connected layer of a convolutional network trained on ImageNet and then fine-tuned for the considered visual recognition task. As such, they form vectors of 4096 dimensions and we are looking for subspace as small as 15. In this setting, 500 images are considered for training, and the remaining portion of the dataset for testing (623 images). There are 9 different classes in this dataset. We examine in this experiment how the proposed dimensionality reduction performs when changing the subspace dimensionality. For this problem, the regularization parameter \(\lambda \) of WDA was empirically set to \(10^{-2}\). The K in KNN was set to 3 which is a common standard setting for this classifier. The reported results reported in Fig. 7 are averaged over 10 realizations of the same experiment. When \(p\ge 5\), WDA already finds a subspace which gathers relevant discriminative information from the original space. In this experiment, LMNN yields to a better subspace for small p values while WDA is the best performing method for \(p\ge 6\). Those results highlight the potential interest for using WDA as linear dimensionality reduction layers in neural-nets architecture.
5.5 Running-time
For the above experiments on MNIST and Caltech, we have also evaluated the running times of the compared algorithms. The LFDA, LMNN, LDSR and CEML codes are the Matlab code that have been released by the authors. Our WDA code is Python-based and relies on the POT toolbox. All these codes have been runned on a 16-core Intel Xeon E5-2630 CPU, operating at 2.4 GHz with GNU/Linux and 144 Gb of RAM.
Averaged running time in seconds of the different algorithms for computing the learned subspaces
Datasets | PCA | FDA | LFDA | LMNN | LSDR | CEML | WDA |
---|---|---|---|---|---|---|---|
Mnist (10) | 0.39(0.1) | 0.69(0.2) | 0.55(0.4) | 20.55(14.2) | 29813(5048) | 87.02(8.7) | 6.28(0.3) |
Mnist (20) | 0.38(0.0) | 0.58(0.0) | 0.54(0.2) | 18.27(17.0) | 60147(11176) | 90.22(8.8) | 6.15(0.1) |
Caltech (14) | 0.53(0.3) | 21.38(6.1) | 11.43(2.0) | 39.56(6.3) | 140776(53036) | 14.59(7.6) | 5.29(0.1) |
5.6 UCI datasets
Average test errors over 20 trials on UCI datasets
Datasets | Orig. | PCA | FDA | LFDA | LMNN | LSDR | LSQMI | CEML | WDA |
---|---|---|---|---|---|---|---|---|---|
Wines | 24.33 | 26.57 | 37.87 | 29.21 | 32.81 | 32.81 | 46.29 | 15.34 | 16.91 |
Iris | 42.07 | 40.60 | 19.27 | 25.13 | 21.67 | 37.93 | 56.27 | 20.87 | 20.87 |
Glass | 54.01 | 58.16 | 57.45 | 59.53 | 54.25 | 50.85 | 65.42 | 34.86 | 45.99 |
Vehicles | 58.68 | 57.26 | 48.57 | 48.25 | 40.84 | 51.86 | 65.09 | 48.46 | 51.13 |
Credit | 28.90 | 25.57 | 18.67 | 17.69 | 23.73 | 24.71 | 39.01 | 17.65 | 17.39 |
Ionosphere | 26.14 | 26.90 | 29.63 | 27.64 | 30.80 | 31.08 | 36.42 | 22.87 | 20.40 |
Isolet | 17.50 | 17.60 | 15.12 | 13.96 | 11.13 | 13.33 | 21.76 | 30.19 | 14.41 |
Usps | 7.59 | 7.66 | 11.63 | 12.76 | 6.05 | 8.77 | 14.83 | 10.15 | 6.50 |
Mnist | 17.26 | 14.16 | 33.85 | 29.92 | 13.95 | 26.53 | 60.05 | 24.68 | 13.07 |
Caltechpca | 23.39 | 13.93 | 12.03 | 18.19 | 11.55 | 36.08 | 100.00 | 13.65 | 11.45 |
Aver. Rank | 5.4 | 5.5 | 5.2 | 5.2 | 3.4 | 5.7 | 8.9 | 3.5 | 2.2 |
We have also compared the performances of the dimensionality reduction algorithms on some UCI benchmark datasets (Lichman 2013). The experimental setting is similar to the one proposed by the authors of LSQMI (Tangkaratt et al. 2015). For these UCI datasets, we have appended the original input features with some noise features of dimensionality 100. We have split the examples 50–50% in a training and test set. Hyper-parameters such as the number of neighbours for the KNN and and the dimensionality of the projection has been cross-validated on the training set and choosed respectively among the values [1 : 2 : 19] (in Matlab notation) and [5, 10, 15, 20, 25]. Splits have been performed 20 times. Note that we have also added experiments with Isolet, USPS, MNIST and Caltech datasets under this validation setting but without the additional noisy features. Table 2 presents the performance of competing methods. We note that our WDA is more robust than all other methods and is able to capture relevant information in the learned subspaces. Its average ranking on all datasets is 2.2 while the second best, LMNN is 3.4. There is only one dataset (vehicles) for which WDA performs significantly worse than top methods. Interestingly LSDR and LSQMI seem to be less robust than LMNN and FDA, against which they have not been compared in the original paper (Tangkaratt et al. 2015).
6 Conclusion
This work presents the Wasserstein Discriminant Analysis, a new and original linear discriminant subspace estimation method. Based on the framework of regularized Wasserstein distances, which measure a global similarity between empirical distributions, WDA operates by separating distributions of different classes in the subspace, while maintaining a coherent structure at a class level. To this extent, the use of regularization in the Wasserstein formulation allows to effectively bridge a gap between a global coherency and the local structure of the class manifold. This comes at a cost of a difficult optimization of a bi-level program, for which we proposed an efficient method based on automatic differentiation of the Sinkhorn algorithm. Numerical experiments show that the method performs well on a variety of features, including those obtained with a deep neural architecture. Future work will consider stochastic versions of the same approach in order to enhance further the ability of the method to handle large volume of high-dimensional data.
Footnotes
References
- Absil, P. A., Mahony, R., & Sepulchre, R. (2009). Optimization algorithms on matrix manifolds. Princeton: Princeton University Press.zbMATHGoogle Scholar
- Bach, F. R., Lanckriet, G. R., & Jordan, M. I. (2004). Multiple kernel learning, conic duality, and the smo algorithm. In: Proceedings of the twenty-first international conference on Machine learning. ACM, p. 6Google Scholar
- Benamou, J. D., Carlier, G., Cuturi, M., Nenna, L., & Peyré, G. (2015). Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2), A1111–A1138.MathSciNetCrossRefGoogle Scholar
- Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computation, 12(8), 1889–1900.MathSciNetCrossRefGoogle Scholar
- Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends\(\textregistered \). Machine Learning, 2(1), 1–127.MathSciNetCrossRefGoogle Scholar
- Bonnans, J. F., & Shapiro, A. (1998). Optimization problems with perturbations: A guided tour. SIAM Review, 40(2), 228–264.MathSciNetCrossRefGoogle Scholar
- Bonneel, N., Peyré, G., & Cuturi, M. (2016). Wasserstein barycentric coordinates: Histogram regression using optimal transport. ACM Transactions on Graphics, 35(4), 71:1–71:10.CrossRefGoogle Scholar
- Boumal, N., Mishra, B., Absil, P. A., & Sepulchre, R. (2014). Manopt, a matlab toolbox for optimization on manifolds. The Journal of Machine Learning Research, 15(1), 1455–1459.zbMATHGoogle Scholar
- Burges, C. J. (2010). Dimension reduction: A guided tour. Boston: Now Publishers.zbMATHGoogle Scholar
- Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3), 131–159.CrossRefGoogle Scholar
- Colson, B., Marcotte, P., & Savard, G. (2007). An overview of bilevel optimization. Annals of Operations Research, 153(1), 235–256.MathSciNetCrossRefGoogle Scholar
- Courty, N., Flamary, R., Tuia, D., & Rakotomamonjy, A. (2016). Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google Scholar
- Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pp. 2292–2300Google Scholar
- Cuturi, M., & Doucet, A. (2014). Fast computation of wasserstein barycenters. In ICML.Google Scholar
- Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of The 31st international conference on machine learning, pp. 647–655.Google Scholar
- Emigh, M., Kriminger, E., & Prîncipe J. C. (2015). Linear discriminant analysis with an information divergence criterion. In 2015 International joint conference on neural networks (IJCNN). IEEE, pp. 1–6Google Scholar
- Fern, X. Z., & Brodley, C. E. (2003). Random projection for high dimensional data clustering: A cluster ensemble approach. In ICML, Vol. 3, pp. 186–193.Google Scholar
- Flamary, R., & Courty, N. (2017). Pot python optimal transport libraryGoogle Scholar
- Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning. Springer series in statistics. Berlin: Springer.zbMATHGoogle Scholar
- Frogner, C., Zhang, C., Mobahi, H., Araya, M., & Poggio, T. (2015). Learning with a wasserstein loss. In NIPS, pp. 2044–2052Google Scholar
- Giraldo, L. G. S., Principe, J. C. (2013). Information theoretic learning with infinitely divisible kernels. In Proceedings of the first international conference on representation learning (ICLR), pp. 1–8Google Scholar
- Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report. CNS-TR-2007-001, California Institute of Technology.Google Scholar
- Huang, G., Guo, C., Kusner, M.J., Sun, Y., Sha, F., Weinberger, K.Q. (2016). Supervised word mover’s distance. In: Advances in Neural Information Processing Systems, pp 4862–4870Google Scholar
- Knight, P. A. (2008). The Sinkhorn–Knopp algorithm: Convergence and applications. SIAM Journal on Matrix Analysis and Applications, 30(1), 261–275.MathSciNetCrossRefGoogle Scholar
- Koep, N., & Weichwald, S. (2016). Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation. Journal of Machine Learning Research, 17, 1–5.MathSciNetzbMATHGoogle Scholar
- Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml.
- Mueller, J., & Jaakkola, T. (2015). Principal differences analysis: Interpretable characterization of differences between distributions. In NIPS, pp. 1693–1701.Google Scholar
- Petersen, K. B., Pedersen, M. S., et al. (2008). The matrix cookbook. Technical University of Denmark, 7, 15.Google Scholar
- Peyré, G., & Cuturi, M. (2018). Computational optimal transport. Foundations and Trends in Computer Science (to be published). https://optimaltransport.github.io.
- Schmidt, M. (2008). Minconf-projection methods for optimization with simple constraints in matlab.Google Scholar
- Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.Google Scholar
- Seguy, V., & Cuturi, M. (2015). Principal geodesic analysis for probability measures under the optimal transport metric. In NIPS, pp. 3294–3302.Google Scholar
- Solomon, J., Rustamov, R., Leonidas, G., & Butscher, A. (2014). Wasserstein propagation for semi-supervised learning. In ICML, pp. 306–314.Google Scholar
- Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. The Journal of Machine Learning Research, 8, 1027–1061.zbMATHGoogle Scholar
- Suzuki, T., & Sugiyama, M. (2013). Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation, 25(3), 725–758.MathSciNetCrossRefGoogle Scholar
- Tangkaratt, V., Sasaki, H., & Sugiyama, M. (2015). Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction. arXiv preprint arXiv:1508.01019.
- Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579–2605), 85.zbMATHGoogle Scholar
- Van Der Maaten, L., Postma, E., & Van den Herik, J. (2009). Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10, 66–71.Google Scholar
- Villani, C. (2008). Optimal transport: Old and new (Vol. 338). Berlin: Springer.zbMATHGoogle Scholar
- Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10, 207–244.zbMATHGoogle Scholar
- Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems, 15, 505–512.Google Scholar
- Zhang, L., Dong, W., Zhang, D., & Shi, G. (2010). Two-stage image denoising by principal component analysis with local pixel grouping. Pattern Recognition, 43(4), 1531–1549.CrossRefGoogle Scholar