Abstract
We develop a class of hierarchically low-rank, scalable optimal transport dissimilarity measures for structured data, bringing the current state-of-the-art optimal transport solvers to a higher level of performance. Given two n-dimensional discrete probability measures supported on two structured grids in \({\mathbb {R}}^d\), we present a fast method for computing an entropically regularized optimal transport distance, referred to as the debiased Sinkhorn distance. The method combines Sinkhorn’s matrix scaling iteration with a low-rank hierarchical representation of the scaling matrices to achieve a near-linear complexity \({{\mathscr {O}}}(n \ln ^4 n)\). This provides a fast, scalable, and easy-to-implement algorithm for computing a class of optimal transport dissimilarity measures, enabling their applicability to large-scale optimization problems, where the computation of the classical Wasserstein metric is not feasible. We carry out a rigorous error-complexity analysis for the proposed algorithm and present several numerical examples to verify the accuracy and efficiency of the algorithm and to demonstrate its applicability in tackling real-world problems.
Similar content being viewed by others
Notes
For example https://github.com/ecrc/hicma.
References
Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds) Advances in Neural Information Processing Systems, Vol. 30, pp. 1961–1971 (2017)
Ambrosio, L., Gigli, N.: A user’s guide to optimal transport. In: Modelling and Optimisation of Flows on Networks, pp. 1–155. Springer (2013)
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhäser Verlag, New York (2005)
Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. ACM 12, 547–560 (1965)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 214–223 (2017)
Baringhaus, L., Franz, C.: On a new multivariate two-sample test. J. Multivar. Anal. 88, 190–206 (2004)
Basu, S., Kolouri, S., Rohde, G.K.: Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry. In: Proceedings of the National Academy of Sciences, pp. 3448–3453 (2014)
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37, A1111–A1138 (2015)
Benamou, J.-D., Froese, B.D., Oberman, A.M.: Numerical solution of the optimal transportation problem using the Monge–Ampère equation. J. Comput. Phys. 260, 107–126 (2014)
Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. Athena Scientific, New York (1997)
Beukema, P., Verstynen, T.D.: OpenNeuro Dataset ds001597:v1.0.0 (multiFingerRSA). OpenNeuro (2018)
Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. Society for Industrial and Applied Mathematics (SIAM) (2012)
Chellaboina, V., Haddad, W.M.: Is the Frobenius matrix norm induced? IEEE Trans. Autom. Control 40, 2137–2139 (1995)
Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., Peyré, G.: Faster Wasserstein distance estimation with the Sinkhorn divergence. arXiv:2006.08172 (2020)
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292–2300 (2013)
Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In: Dy, J., Krause, A. (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 1367–1376 (2018)
Engquist, B., Froese, B.D.: Application of the Wasserstein metric to seismic signals. Commun. Math. Sci. 12(5), 979–988 (2014)
Engquist, B., Froese, B.D., Yang, Y.: Optimal transport for seismic full waveform inversion. Commun. Math. Sci. 14(8), 2309–2330 (2016)
Ferradans, S., Xia, G.-S., Peyré, G., Aujol, J.-F.: Static and dynamic texture mixing using optimal transport. In: Scale Space and Variational Methods in Computer Vision, pp. 137–148 (2013)
Feydy, J., Vialard, F.-X., Séjourné, T., Amari, S., Trouvé, A., Peyré, G.: Interpolating between optimal transport and MMD using Sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690 (2019)
Franklin, J., Lorenz, J.: On the scaling of multidimensional matrices. Linear Algebra Appl. 114, 717–735 (1989)
Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 28, 2053–2061 (2015)
Genevay, A.,Cuturi, M., Peyré, G., Bach, F.: Stochastic optimization for large-scale optimal transport. In: Advances in Neural Information Processing Systems, pp. 3432–3440 (2016)
Genevay, A., Peyré, G., Cuturi, M.: Learning generative models with Sinkhorn divergences. In: The 21st International Conference on Artificial Intelligence and Statistics, pp. 1608–1617 (2018)
Gottschlich, C., Schuhmacher, D.: The shortlist method for fast computation of the earth mover’s distance and finding optimal solutions to transportation problems. PLOS ONE 9 (2014)
Gramfort, A., Peyré, G., Cuturi, M.: Fast optimal transport averaging of neuroimaging data. In: Ourselin, S., Alexander, D., Westin, C.F., Cardoso, M. (eds) Information Processing in Medical Imaging, IPMI 2015, pp. 261–272 (2015)
Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 19, 513–520 (2006)
Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis Springer Series in Computational Mathematics, vol. 49. Springer, Berlin (2015)
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Joyce, D.C.: Survey of extrapolation processes in numerical analysis. SIAM Rev. 13, 435–490 (1971)
Kalantari, B., Khachiyan, L.: On the complexity of nonnegative-matrix scaling. Linear Algebra Appl. 240, 87–103 (1996)
Kantorovich, L.V.: On translation of mass (in Russian). Doklady Acad. Sci. USSR 37, 199–201 (1942)
Keyes, D.E., Ltaief, H., Turkiyyah, G.: Hierarchical algorithms on hierarchical architectures. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 378(2166), 20190055 (2020)
Knight, P.A.: The Sinkhorn–Knopp algorithm: convergence and applications. SIAM J. Matrix Anal. Appl. 30, 261–275 (2008)
Kolouri, S., Park, S., Thorpe, M., Slepčev, D., Rohde, G.K.: Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34, 43–59 (2017)
Lellmann, J., Lorenz, D.A., Schönlieb, C., Valkonen, T.: Imaging with Kantorovich–Rubinstein discrepancy. SIAM J. Imaging Sci. 7, 2833–2859 (2014)
Léonard, C.: A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Contin. Dyn. Syst. A 34, 1533–1574 (2014)
Li, P., Wang, Q., Zhang, L.: A novel earth mover’s distance methodology for image matching with Gaussian mixture models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1689–1696 (2013)
Mérigot, Q.: A multiscale approach to optimal transport. In: Computer Graphics Forum, Vol. 30, pp. 1583–1592. Wiley (2011)
Monge, G.: Mémoire sur la théorie des déblais et des remblais. De l’Imprimerie Royale (1781)
Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted Boltzmann machines. Adv. Neural Inf. Process. Syst. 29, 3718–3726 (2016)
Moselhy, E.T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231, 7815–7850 (2012)
Motamed, M., Appelö, D.: Wasserstein metric-driven Bayesian inversion with applications to signal processing. Int. J. Uncertain. Quantif. 9, 395–414 (2019)
Nemirovski, A., Rothblum, U.: On complexity of matrix scaling. Linear Algebra Appl. 302, 435–460 (1999)
Oberman, A.M., Ruan, Y.: An efficient linear programming method for optimal transportation. arXiv:1509.03668 (2015)
Panaretos, V.M., Zemel, Y.: Statistical aspects of wasserstein distances. Ann. Rev. Stat. Appl. 6, 405–431 (2019)
Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11, 355–607 (2019)
Ramdas, A., Trillos, N.G., Cuturi, M.: On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 19, 47 (2017)
Richardson, L.F.: The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam. Philos. Trans. R. Soc. Lond. Ser. A 210, 307–357 (1911)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)
Salimans, T., Zhang, H., Radford, A., Metaxas, D.: Improving GANs using optimal transport. In: International Conference on Learning Representations (2018)
Sanjabi, M., Ba, J., Razaviyayn, M., Lee, J. D.: On the convergence and robustness of training GANs with regularized optimal transport. In: Advances in Neural Information Processing Systems 31 (2018)
Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)
Schmitzer, B.: A sparse multiscale algorithm for dense optimal transport. J. Math. Imaging Vis. 56, 238–259 (2016)
Schrödinger, E.: Über die umkehrung der naturgesetze. Sitzungsberichte Preuss. Akad. Wiss. Berlin. Phys. Math. 144:144–153 (1931)
Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. Adv. Neural Inf. Process. Syst. 29, 712–720 (2016)
Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964)
Sinkhorn, R., Knopp, P.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)
Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Trans. Graph. 34, 66:1-66:11 (2015)
Székely, G.J., Rizzo, M.L.: Testing for equal distributions in high dimension. InterStat (2004)
Thibault, A., Chizat, L., Dossal, C. H., Papadakis, N.: Overrelaxed Sinkhorn–Knopp algorithm for regularized optimal transport. In: NIPS’17 Workshop on Optimal Transport and Machine Learning, Long Beach, United States (2017)
Villani, C.: Topics in Optimal Transportation, volume 58 of Graduate Studies in Mathematics. American Mathematical Society (2003)
Villani, C.: Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer (2009)
Yan, J., Deng, C., Luo, L., Wang, X., Yao, X., Shen, L., Huang, H.: Identifying imaging markers for predicting cognitive assessments using Wasserstein distances based matrix regression. Front. Neurosci. 13, 668 (2019)
Yang, Y., Engquist, B., Sun, J., Hamfeldt, B.F.: Application of optimal transport and the quadratic Wasserstein metric to full-waveform inversion. Geophysics 83(1), R43–R62 (2018)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not Applicable.
Additional information
Communicated by Elias Jarlebring.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Lemma 3.1
We first show that for real \(n_k \times n_k\) matrices \(C^{(k)} = [C_{ij}^{(k)}]\), with \(k=1, \dotsc , d\), we have
where \(\exp [\, \cdot \, ]\) denotes element-wise exponentiation, and \(\oplus \) denotes the “all-ones Kronecker sum” operator in Definition 3.1. The case \(d=1\) is trivial. Consider the case \(d=2\). By the definition of the all-ones Kronecker sum, we have
Now assume that it is true for \(d-1\):
Then
The proof of Lemma 3.1 simply follows from (6.1) and (3.1). \(\square \)
Proof of Lemma 3.2
We utilize the vectorization operator \(\text {vec}(.)\) that converts \(m\times n\) matrices into \(mn \times 1\) vectors, and denote by \(\mathbf{v} =\text {vec}(V)\) the vector obtained by stacking the columns of matrix V beneath one another. We proceed by induction on the dimension d, and note that by Lemma 3.1 the kernel matrix Q is of the block form (3.3). The case \(d=1\) is trivial, noting that \({\hat{n}}_1 = 1\). Let \(d=2\) and \(Q = Q_2 \otimes Q_1 \in {{\mathbb {R}}}_+^{n \times n}\), with \(Q_1 \in {\mathbb R}_+^{n_1 \times n_1}\), \(Q_2 \in {{\mathbb {R}}}_+^{n_2 \times n_2}\), and \(n = n_1 n_2\). Let further \(V = \text {vec}^{-1}(\mathbf{v}) \in {{\mathbb {R}}}^{n_1 \times n_2}\) be the matrix whose vectorization gives the vector \(\mathbf{v} =\text {vec}(V) \in {{\mathbb {R}}}^{n_1 n_2 \times 1 }\). Then, we have
and computing \(Q_1 V Q_2^{\top }\) can be carried out in two steps. The first step \(W:= V Q_2^{\top }\) requires \(n_1\) multiplications \(Q_2 \tilde{\mathbf{v}}_i\), \(i=1, \dotsc , n_1\), where \(\tilde{\mathbf{v}}_i\) is the i-th of row of V in column-vector form. The second step \(Q_1 W\) requires \(n_2\) multiplications \(Q_1 \mathbf{w}_j\), \(j=1, \dotsc , n_2\), where \(\mathbf{w}_j\) is the j-th column of W. Overall, we need \(n_1\) matrix-vector multiplications \(Q_2 \mathbf{z}\) for \(n_1\) different vectors \(\mathbf{z}\), and \(n_2\) matrix-vector multiplications \(Q_1 \mathbf{z}\) for \(n_2\) different vectors \(\mathbf{z}\), as desired. This can be recursively extended to higher dimensions (\(d \ge 3\)), using the associative property of the kronecker product \(A \otimes B \otimes C = A \otimes (B \otimes C)\). Setting \({\hat{Q}}_d:=Q_{d-1} \otimes \cdots \otimes Q_1\), we can write
This can again be done in two steps. We first compute \(W:=V Q_d^{\top } \in {{\mathbb {R}}}^{{\hat{n}}_d \times n_d}\) that requires \({\hat{n}}_d\) multiplications \(Q_d \tilde{\mathbf{v}}_k\), where \(\tilde{\mathbf{v}}_i \in {{\mathbb {R}}}^{n_d \times 1}\) is the i-th row of matrix V in the form of a column vector, with \(i=1, \dotsc , {\hat{n}}_d\). We then compute \({\hat{Q}}_d W\) that requires \(n_d\) matrix-vector multiplications \({\hat{Q}}_d \mathbf{w}_j\), where \(\mathbf{w}_j \in {\mathbb R}^{{\tilde{n}} \times 1}\) is the j-th column of matrix W, with \(j=1, \dotsc , n_d\). Noting that each multiplication \({\hat{Q}}_d \mathbf{w}_j\) is of the same form as (6.2) but in \(d-1\) dimensions, the lemma follows easily by induction. \(\square \)
Proof of Lemma 4.1
The proof follows by induction. For the case \(d=1\), we simply note that
We next consider \(d \ge 2\) and assume that the lemma holds for the case \(d-1\),
where \({\hat{Q}}_d:= Q_{d-1} \otimes \cdots \otimes Q_1\) and \(\mathbf{z} \in {{\mathbb {R}}}^{{\hat{n}}_d}\), with \({\hat{n}}_d = n/n_d\). We will show the lemma also holds for the case d. We first note that
Following the same notation used in the proof of Lemma 3.2 and in Function 1, we denote by \(\tilde{\mathbf{v}}_i\) and \(\tilde{\mathbf{w}}_i\) the i-th rows of matrices V and \(W = V \, Q_d^{\top }\) in column-vector forms, respectively, and by \(\mathbf{v}_j\) and \(\mathbf{w}_j\) the j-th columns of those two matrices. We then follow the two-step strategy in Function 1 for computing \(Q \mathbf{v}\). The first step is to approximate \(\tilde{\mathbf{w}}_i := Q_d \, \tilde{\mathbf{v}}_i\) by \(\tilde{\mathbf{w}}_i^{\text {H}} := Q_d^{\text {H}} \, \tilde{\mathbf{v}}_i\). This gives us
and
The second step is to approximate \(\mathbf{s}_j := {\hat{Q}}_d \, \mathbf{w}_j\) by \(\mathbf{s}_j^{\text {H}} := {\hat{Q}}_d^{\text {H}} \, \mathbf{w}_j^{\text {H}}\). This gives us, by triangle inequality,
The first term in the right hand side of the above inequality can be bounded using the compatibility and submultiplicativity of the Frobenius norm, and the second term can be bounded by the inductive hypothesis (6.3), as follows
In order to use the estimates (6.4)–(6.5) in the above inequality, we first note that if for three nonnegative numbers (a, b, c) we have \(a \le b+c\), then \(a^2 \le 2 b^2 + 2 c^2\). We hence write
We then utilize the connection between the \(L^2\)-norm of rows \(\tilde{\mathbf{w}}_i\) and columns \(\mathbf{w}_j\) of a matrix
and, thanks to (6.4)–(6.5), write
Setting \(\mathbf{s}:= Q \mathbf{v}\) and \(\mathbf{s}^{\text {H}}:= Q^{\text {H}} \mathbf{v}\), and noting that \(|| \mathbf{s} - \mathbf{s}^{\text {H}} ||_2^2 = \sum _j || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2\) and that \(|| \mathbf{v} ||_2^2 = \sum _i || \tilde{\mathbf{v}}_i ||_2^2\), we obtain
It remains to note that the solution of the recursive equation \(h(d)^2 = 2 (1 + h(d-1)^2)\) with \(h(1)=1\) is given by \(h(d) = (2^d + 2^{d-1}-2)^{1/2}\), as desired. \(\square \)
Proof of Lemma 4.2
By (2.13) and (2.12), we first write
where in the second inequality, we have assumed without loss of generality that the approximation error in the entropic optimal cost of \((\mathbf{f}_n, \mathbf{g}_n)\) is larger than or equal to that of \((\mathbf{f}_n, \mathbf{f}_n)\) and \((\mathbf{g}_n, \mathbf{g}_n)\). By Cauchy-Schwarz inequality \(|\mathbf{v}^{\top } \mathbf{w}| \le || \mathbf{v} ||_2 \, || \mathbf{w} ||_2\), we then get
We next find an upper bound for \(|| \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) ||_2\). By the first Sinkhorn’s formula (2.10), we have
Then, by the logarithm inequality \(\ln (1+\beta ) \le \beta \), for \(\beta > -1\), we get
with \(\omega >0\) as in the statement of the lemma. Hence,
where the second inequality is the triangle inequality, and the third inequality follows from Lemma 4.1 and (4.5) and the compatibility of the Frobenius matrix norm with Euclidean vector norm \(||A \mathbf{v} ||_2 \le ||A||_{\text {F}} ||\mathbf{v}||_2\). To further bound the second term of the above inequality, we use the second Sinkhorn’s formula (2.10) and write
Hence, noting that \(\max \mathbf{g}_n \le 1\), we obtain
By the triangle inequality and Lemma 4.1 and (4.5), we then get
Using the logarithm inequality \(1-1/\beta \le \ln \beta \) for \(\beta >0\), and setting \(\beta \) to be the elements of the vector \(\mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)}\), we have
where we have used \(\max \mathbf{u}^{(i-1)} \le 1 / \omega \), which follows by (2.10) and since \(\max \mathbf{f}_n \le 1\). Noting that \(|| \mathbf{u}^{(i-1)} ||_2 \le n / \omega \), the last two inequalities above therefore give us
Now, setting
by (6.7) and (6.8), and noting that \(|| \mathbf{v}^{(i-1)} ||_2 \le n / \omega \), we obtain the recursive formula
where \(c_1\) is a constant that depends only on \(\omega \) and d. Noting that \(\delta _\mathbf{u}^{(0)} = 0\), we finally obtain
where \(c_2\) is a constant that depends only on \(c_1\). Finally, by the sum of a finite geometric series, we get
where \(c_3\) is a constant that depends only on \(c_2\), that is, on \(\omega \) and d. The desired estimate (4.6) follows by (6.6) and (6.9), and noting that a similar estimate as (6.9) holds for \(\delta _\mathbf{v}^{(K)} := || \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) ||_2\). \(\square \)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Motamed, M. A hierarchically low-rank optimal transport dissimilarity measure for structured data. Bit Numer Math 62, 1945–1982 (2022). https://doi.org/10.1007/s10543-022-00937-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10543-022-00937-9
Keywords
- Optimal transport
- Wasserstein metric
- Sinkhorn divergence
- Entropic regularization
- Hierarchical matrices
- Optimization problems