A hierarchically low-rank optimal transport dissimilarity measure for structured data

Motamed, Mohammad

doi:10.1007/s10543-022-00937-9

A hierarchically low-rank optimal transport dissimilarity measure for structured data

Published: 08 September 2022

Volume 62, pages 1945–1982, (2022)
Cite this article

BIT Numerical Mathematics Aims and scope Submit manuscript

Mohammad Motamed ORCID: orcid.org/0000-0002-1421-3694¹

303 Accesses
Explore all metrics

Abstract

We develop a class of hierarchically low-rank, scalable optimal transport dissimilarity measures for structured data, bringing the current state-of-the-art optimal transport solvers to a higher level of performance. Given two n-dimensional discrete probability measures supported on two structured grids in ${\mathbb {R}}^d$, we present a fast method for computing an entropically regularized optimal transport distance, referred to as the debiased Sinkhorn distance. The method combines Sinkhorn’s matrix scaling iteration with a low-rank hierarchical representation of the scaling matrices to achieve a near-linear complexity ${{\mathscr {O}}}(n \ln ^4 n)$. This provides a fast, scalable, and easy-to-implement algorithm for computing a class of optimal transport dissimilarity measures, enabling their applicability to large-scale optimization problems, where the computation of the classical Wasserstein metric is not feasible. We carry out a rigorous error-complexity analysis for the proposed algorithm and present several numerical examples to verify the accuracy and efficiency of the algorithm and to demonstrate its applicability in tackling real-world problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

A class of accelerated GADMM-based method for multi-block nonconvex optimization problems

Article 13 April 2024

Notes

For example https://github.com/ecrc/hicma.

References

Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds) Advances in Neural Information Processing Systems, Vol. 30, pp. 1961–1971 (2017)
Ambrosio, L., Gigli, N.: A user’s guide to optimal transport. In: Modelling and Optimisation of Flows on Networks, pp. 1–155. Springer (2013)
Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhäser Verlag, New York (2005)
MATH Google Scholar
Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. ACM 12, 547–560 (1965)
Article MathSciNet MATH Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 214–223 (2017)
Baringhaus, L., Franz, C.: On a new multivariate two-sample test. J. Multivar. Anal. 88, 190–206 (2004)
Article MathSciNet MATH Google Scholar
Basu, S., Kolouri, S., Rohde, G.K.: Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry. In: Proceedings of the National Academy of Sciences, pp. 3448–3453 (2014)
Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37, A1111–A1138 (2015)
Article MathSciNet MATH Google Scholar
Benamou, J.-D., Froese, B.D., Oberman, A.M.: Numerical solution of the optimal transportation problem using the Monge–Ampère equation. J. Comput. Phys. 260, 107–126 (2014)
Article MathSciNet MATH Google Scholar
Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. Athena Scientific, New York (1997)
Google Scholar
Beukema, P., Verstynen, T.D.: OpenNeuro Dataset ds001597:v1.0.0 (multiFingerRSA). OpenNeuro (2018)
Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. Society for Industrial and Applied Mathematics (SIAM) (2012)
Chellaboina, V., Haddad, W.M.: Is the Frobenius matrix norm induced? IEEE Trans. Autom. Control 40, 2137–2139 (1995)
Article MathSciNet MATH Google Scholar
Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., Peyré, G.: Faster Wasserstein distance estimation with the Sinkhorn divergence. arXiv:2006.08172 (2020)
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292–2300 (2013)
Google Scholar
Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In: Dy, J., Krause, A. (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 1367–1376 (2018)
Engquist, B., Froese, B.D.: Application of the Wasserstein metric to seismic signals. Commun. Math. Sci. 12(5), 979–988 (2014)
Article MathSciNet MATH Google Scholar
Engquist, B., Froese, B.D., Yang, Y.: Optimal transport for seismic full waveform inversion. Commun. Math. Sci. 14(8), 2309–2330 (2016)
Article MathSciNet MATH Google Scholar
Ferradans, S., Xia, G.-S., Peyré, G., Aujol, J.-F.: Static and dynamic texture mixing using optimal transport. In: Scale Space and Variational Methods in Computer Vision, pp. 137–148 (2013)
Feydy, J., Vialard, F.-X., Séjourné, T., Amari, S., Trouvé, A., Peyré, G.: Interpolating between optimal transport and MMD using Sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690 (2019)
Franklin, J., Lorenz, J.: On the scaling of multidimensional matrices. Linear Algebra Appl. 114, 717–735 (1989)
Article MathSciNet MATH Google Scholar
Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 28, 2053–2061 (2015)
Google Scholar
Genevay, A.,Cuturi, M., Peyré, G., Bach, F.: Stochastic optimization for large-scale optimal transport. In: Advances in Neural Information Processing Systems, pp. 3432–3440 (2016)
Genevay, A., Peyré, G., Cuturi, M.: Learning generative models with Sinkhorn divergences. In: The 21st International Conference on Artificial Intelligence and Statistics, pp. 1608–1617 (2018)
Gottschlich, C., Schuhmacher, D.: The shortlist method for fast computation of the earth mover’s distance and finding optimal solutions to transportation problems. PLOS ONE 9 (2014)
Gramfort, A., Peyré, G., Cuturi, M.: Fast optimal transport averaging of neuroimaging data. In: Ourselin, S., Alexander, D., Westin, C.F., Cardoso, M. (eds) Information Processing in Medical Imaging, IPMI 2015, pp. 261–272 (2015)
Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 19, 513–520 (2006)
MATH Google Scholar
Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis Springer Series in Computational Mathematics, vol. 49. Springer, Berlin (2015)
Book Google Scholar
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Book Google Scholar
Joyce, D.C.: Survey of extrapolation processes in numerical analysis. SIAM Rev. 13, 435–490 (1971)
Article MathSciNet MATH Google Scholar
Kalantari, B., Khachiyan, L.: On the complexity of nonnegative-matrix scaling. Linear Algebra Appl. 240, 87–103 (1996)
Article MathSciNet MATH Google Scholar
Kantorovich, L.V.: On translation of mass (in Russian). Doklady Acad. Sci. USSR 37, 199–201 (1942)
Google Scholar
Keyes, D.E., Ltaief, H., Turkiyyah, G.: Hierarchical algorithms on hierarchical architectures. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 378(2166), 20190055 (2020)
Article MathSciNet MATH Google Scholar
Knight, P.A.: The Sinkhorn–Knopp algorithm: convergence and applications. SIAM J. Matrix Anal. Appl. 30, 261–275 (2008)
Article MathSciNet MATH Google Scholar
Kolouri, S., Park, S., Thorpe, M., Slepčev, D., Rohde, G.K.: Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34, 43–59 (2017)
Article Google Scholar
Lellmann, J., Lorenz, D.A., Schönlieb, C., Valkonen, T.: Imaging with Kantorovich–Rubinstein discrepancy. SIAM J. Imaging Sci. 7, 2833–2859 (2014)
Article MathSciNet MATH Google Scholar
Léonard, C.: A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Contin. Dyn. Syst. A 34, 1533–1574 (2014)
Article MathSciNet MATH Google Scholar
Li, P., Wang, Q., Zhang, L.: A novel earth mover’s distance methodology for image matching with Gaussian mixture models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1689–1696 (2013)
Mérigot, Q.: A multiscale approach to optimal transport. In: Computer Graphics Forum, Vol. 30, pp. 1583–1592. Wiley (2011)
Monge, G.: Mémoire sur la théorie des déblais et des remblais. De l’Imprimerie Royale (1781)
Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted Boltzmann machines. Adv. Neural Inf. Process. Syst. 29, 3718–3726 (2016)
Google Scholar
Moselhy, E.T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231, 7815–7850 (2012)
Article MathSciNet MATH Google Scholar
Motamed, M., Appelö, D.: Wasserstein metric-driven Bayesian inversion with applications to signal processing. Int. J. Uncertain. Quantif. 9, 395–414 (2019)
Article MathSciNet Google Scholar
Nemirovski, A., Rothblum, U.: On complexity of matrix scaling. Linear Algebra Appl. 302, 435–460 (1999)
Article MathSciNet MATH Google Scholar
Oberman, A.M., Ruan, Y.: An efficient linear programming method for optimal transportation. arXiv:1509.03668 (2015)
Panaretos, V.M., Zemel, Y.: Statistical aspects of wasserstein distances. Ann. Rev. Stat. Appl. 6, 405–431 (2019)
Article MathSciNet Google Scholar
Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11, 355–607 (2019)
Article MATH Google Scholar
Ramdas, A., Trillos, N.G., Cuturi, M.: On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 19, 47 (2017)
Article MathSciNet Google Scholar
Richardson, L.F.: The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam. Philos. Trans. R. Soc. Lond. Ser. A 210, 307–357 (1911)
Article MATH Google Scholar
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)
Article MATH Google Scholar
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)
Article MATH Google Scholar
Salimans, T., Zhang, H., Radford, A., Metaxas, D.: Improving GANs using optimal transport. In: International Conference on Learning Representations (2018)
Sanjabi, M., Ba, J., Razaviyayn, M., Lee, J. D.: On the convergence and robustness of training GANs with regularized optimal transport. In: Advances in Neural Information Processing Systems 31 (2018)
Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)
Book MATH Google Scholar
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)
Article MathSciNet MATH Google Scholar
Schmitzer, B.: A sparse multiscale algorithm for dense optimal transport. J. Math. Imaging Vis. 56, 238–259 (2016)
Article MathSciNet MATH Google Scholar
Schrödinger, E.: Über die umkehrung der naturgesetze. Sitzungsberichte Preuss. Akad. Wiss. Berlin. Phys. Math. 144:144–153 (1931)
Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. Adv. Neural Inf. Process. Syst. 29, 712–720 (2016)
MATH Google Scholar
Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964)
Article MathSciNet MATH Google Scholar
Sinkhorn, R., Knopp, P.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)
Article MathSciNet MATH Google Scholar
Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Trans. Graph. 34, 66:1-66:11 (2015)
Article MATH Google Scholar
Székely, G.J., Rizzo, M.L.: Testing for equal distributions in high dimension. InterStat (2004)
Thibault, A., Chizat, L., Dossal, C. H., Papadakis, N.: Overrelaxed Sinkhorn–Knopp algorithm for regularized optimal transport. In: NIPS’17 Workshop on Optimal Transport and Machine Learning, Long Beach, United States (2017)
Villani, C.: Topics in Optimal Transportation, volume 58 of Graduate Studies in Mathematics. American Mathematical Society (2003)
Villani, C.: Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer (2009)
Yan, J., Deng, C., Luo, L., Wang, X., Yao, X., Shen, L., Huang, H.: Identifying imaging markers for predicting cognitive assessments using Wasserstein distances based matrix regression. Front. Neurosci. 13, 668 (2019)
Article Google Scholar
Yang, Y., Engquist, B., Sun, J., Hamfeldt, B.F.: Application of optimal transport and the quadratic Wasserstein metric to full-waveform inversion. Geophysics 83(1), R43–R62 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of New Mexico, Albuquerque, USA
Mohammad Motamed

Authors

Mohammad Motamed
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Motamed.

Ethics declarations

Conflict of interest

Not Applicable.

Additional information

Communicated by Elias Jarlebring.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Lemma 3.1

We first show that for real $n_k \times n_k$ matrices $C^{(k)} = [C_{ij}^{(k)}]$, with $k=1, \dotsc , d$, we have

$$\begin{aligned} \exp [C^{(d)} \oplus \cdots \oplus C^{(1)}] = \exp [C^{(d)}] \otimes \cdots \otimes \exp [C^{(1)}], \end{aligned}$$

(6.1)

where $\exp [\, \cdot \, ]$ denotes element-wise exponentiation, and $\oplus $ denotes the “all-ones Kronecker sum” operator in Definition 3.1. The case $d=1$ is trivial. Consider the case $d=2$. By the definition of the all-ones Kronecker sum, we have

$$\begin{aligned}&\exp [C^{(2)} \oplus C^{(1)}] = \exp [C^{(2)} \otimes J_1 + J_2 \otimes C^{(1)}]= \\&\quad = \left( \begin{array}{lll} \exp (C_{11}^{(2)}) J_1 &{} \cdots &{} \exp (C_{1n_2}^{(2)}) J_1\\ \hline \vdots &{} \ddots &{} \vdots \\ \hline \exp (C_{n_2 1}^{(2)}) J_1 &{} \cdots &{} \exp (C_{n_2 n_2}^{(2)}) J_1 \end{array}\right) \\&\quad \odot \left( \begin{array}{lll} \exp [C^{(1)}] &{} \cdots &{} \exp [C^{(1)}]\\ \hline \vdots &{} \ddots &{} \vdots \\ \hline \exp [C^{(1)}] &{} \cdots &{} \exp [C^{(1)}] \end{array}\right) = \\&\quad = \left( \begin{array}{lll} \exp (C_{11}^{(2)}) \exp [C^{(1)}] &{} \cdots &{} \exp (C_{1n_2}^{(2)}) \exp [C^{(1)}]\\ \hline \vdots &{} \ddots &{} \vdots \\ \hline \exp (C_{n_2 1}^{(2)}) \exp [C^{(1)}] &{} \cdots &{} \exp (C_{n_2 n_2}^{(2)}) \exp [C^{(1)}] \end{array}\right) = \\&\quad =\exp [C^{(2)}] \otimes \exp [C^{(1)}]. \end{aligned}$$

Now assume that it is true for $d-1$:

$$\begin{aligned} \exp [C^{(d-1)} \oplus \cdots \oplus C^{(1)}] = \exp [C^{(d-1)}] \otimes \cdots \otimes \exp [C^{(1)}]. \end{aligned}$$

Then

$$\begin{aligned}&\exp [C^{(d)} \oplus \cdots \oplus C^{(1)}] = \exp [C^{(d)}] \otimes \exp [C^{(d-1)} \oplus \cdots \oplus C^{(1)}]\\&\quad = \exp [C^{(d)}] \otimes \cdots \otimes \exp [C^{(1)}]. \end{aligned}$$

The proof of Lemma 3.1 simply follows from (6.1) and (3.1). $\square $

Proof of Lemma 3.2

We utilize the vectorization operator $\text {vec}(.)$ that converts $m\times n$ matrices into $mn \times 1$ vectors, and denote by $\mathbf{v} =\text {vec}(V)$ the vector obtained by stacking the columns of matrix V beneath one another. We proceed by induction on the dimension d, and note that by Lemma 3.1 the kernel matrix Q is of the block form (3.3). The case $d=1$ is trivial, noting that ${\hat{n}}_1 = 1$. Let $d=2$ and $Q = Q_2 \otimes Q_1 \in {{\mathbb {R}}}_+^{n \times n}$, with $Q_1 \in {\mathbb R}_+^{n_1 \times n_1}$, $Q_2 \in {{\mathbb {R}}}_+^{n_2 \times n_2}$, and $n = n_1 n_2$. Let further $V = \text {vec}^{-1}(\mathbf{v}) \in {{\mathbb {R}}}^{n_1 \times n_2}$ be the matrix whose vectorization gives the vector $\mathbf{v} =\text {vec}(V) \in {{\mathbb {R}}}^{n_1 n_2 \times 1 }$. Then, we have

$$\begin{aligned} Q \mathbf{v} = (Q_2 \otimes Q_1) \, {\text {vec}}(V) = \text {vec}(Q_1 V Q_2^{\top }), \end{aligned}$$

and computing $Q_1 V Q_2^{\top }$ can be carried out in two steps. The first step $W:= V Q_2^{\top }$ requires $n_1$ multiplications $Q_2 \tilde{\mathbf{v}}_i$, $i=1, \dotsc , n_1$, where $\tilde{\mathbf{v}}_i$ is the i-th of row of V in column-vector form. The second step $Q_1 W$ requires $n_2$ multiplications $Q_1 \mathbf{w}_j$, $j=1, \dotsc , n_2$, where $\mathbf{w}_j$ is the j-th column of W. Overall, we need $n_1$ matrix-vector multiplications $Q_2 \mathbf{z}$ for $n_1$ different vectors $\mathbf{z}$, and $n_2$ matrix-vector multiplications $Q_1 \mathbf{z}$ for $n_2$ different vectors $\mathbf{z}$, as desired. This can be recursively extended to higher dimensions ($d \ge 3$), using the associative property of the kronecker product $A \otimes B \otimes C = A \otimes (B \otimes C)$. Setting ${\hat{Q}}_d:=Q_{d-1} \otimes \cdots \otimes Q_1$, we can write

$$\begin{aligned} Q \mathbf{v} = (Q_d \otimes Q_{d-1} \otimes \cdots \otimes Q_1) \mathbf{v} = (Q_d \otimes {\hat{Q}}_d) \, {\text {vec}}(V) = \text {vec}({\hat{Q}}_d V Q_d^{\top }). \end{aligned}$$

(6.2)

This can again be done in two steps. We first compute $W:=V Q_d^{\top } \in {{\mathbb {R}}}^{{\hat{n}}_d \times n_d}$ that requires ${\hat{n}}_d$ multiplications $Q_d \tilde{\mathbf{v}}_k$, where $\tilde{\mathbf{v}}_i \in {{\mathbb {R}}}^{n_d \times 1}$ is the i-th row of matrix V in the form of a column vector, with $i=1, \dotsc , {\hat{n}}_d$. We then compute ${\hat{Q}}_d W$ that requires $n_d$ matrix-vector multiplications ${\hat{Q}}_d \mathbf{w}_j$, where $\mathbf{w}_j \in {\mathbb R}^{{\tilde{n}} \times 1}$ is the j-th column of matrix W, with $j=1, \dotsc , n_d$. Noting that each multiplication ${\hat{Q}}_d \mathbf{w}_j$ is of the same form as (6.2) but in $d-1$ dimensions, the lemma follows easily by induction. $\square $

Proof of Lemma 4.1

The proof follows by induction. For the case $d=1$, we simply note that

$$\begin{aligned} || Q \mathbf{v} - Q^{\text {H}} \mathbf{v} ||_2 = || (Q - Q^{\text {H}}) \mathbf{v} ||_2 \le || Q - Q^{\text {H}} ||_{F} \, || \mathbf{v} ||_2. \end{aligned}$$

We next consider $d \ge 2$ and assume that the lemma holds for the case $d-1$,

$$\begin{aligned} || {\hat{Q}}_d \, \mathbf{z} - {\hat{Q}}_d^{\text {H}} \, \mathbf{z} ||_2 \le h(d-1) \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{d-2} \, \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F \, ||\mathbf{z} ||_2,\nonumber \\ \end{aligned}$$

(6.3)

where ${\hat{Q}}_d:= Q_{d-1} \otimes \cdots \otimes Q_1$ and $\mathbf{z} \in {{\mathbb {R}}}^{{\hat{n}}_d}$, with ${\hat{n}}_d = n/n_d$. We will show the lemma also holds for the case d. We first note that

$$\begin{aligned} Q \mathbf{v} = \text {vec}({\hat{Q}}_d \, V \, Q_d^{\top }), \qquad V:= \text {vec}^{-1}(\mathbf{v}). \end{aligned}$$

Following the same notation used in the proof of Lemma 3.2 and in Function 1, we denote by $\tilde{\mathbf{v}}_i$ and $\tilde{\mathbf{w}}_i$ the i-th rows of matrices V and $W = V \, Q_d^{\top }$ in column-vector forms, respectively, and by $\mathbf{v}_j$ and $\mathbf{w}_j$ the j-th columns of those two matrices. We then follow the two-step strategy in Function 1 for computing $Q \mathbf{v}$. The first step is to approximate $\tilde{\mathbf{w}}_i := Q_d \, \tilde{\mathbf{v}}_i$ by $\tilde{\mathbf{w}}_i^{\text {H}} := Q_d^{\text {H}} \, \tilde{\mathbf{v}}_i$. This gives us

$$\begin{aligned} || \tilde{\mathbf{w}}_i - \tilde{\mathbf{w}}_i^{\text {H}} ||_2 = || (Q_d - Q_d^{\text {H}}) \tilde{\mathbf{v}}_i ||_2 \le || Q_d - Q_d^{\text {H}} ||_{F} \, || \tilde{\mathbf{v}}_i ||_2, \end{aligned}$$

(6.4)

and

$$\begin{aligned} || \tilde{\mathbf{w}}_i^{\text {H}} ||_2 \le || Q_d ||_F \, || \tilde{\mathbf{v}}_i ||_2. \end{aligned}$$

(6.5)

The second step is to approximate $\mathbf{s}_j := {\hat{Q}}_d \, \mathbf{w}_j$ by $\mathbf{s}_j^{\text {H}} := {\hat{Q}}_d^{\text {H}} \, \mathbf{w}_j^{\text {H}}$. This gives us, by triangle inequality,

$$\begin{aligned} || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2 = || {\hat{Q}}_d \, \mathbf{w}_j - {\hat{Q}}_d^{\text {H}} \, \mathbf{w}_j^{\text {H}}||_2 \le || {\hat{Q}}_d \, (\mathbf{w}_j - \mathbf{w}_j^{\text {H}}) ||_2 + || ({\hat{Q}}_d - {\hat{Q}}_d^{\text {H}}) \, \mathbf{w}_j^{\text {H}} ||_2 \end{aligned}$$

The first term in the right hand side of the above inequality can be bounded using the compatibility and submultiplicativity of the Frobenius norm, and the second term can be bounded by the inductive hypothesis (6.3), as follows

$$\begin{aligned}&|| \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2 \le \max _{k=1, \dotsc , d-1} || Q_k ||_F^{d-1} \, || \mathbf{w}_j - \mathbf{w}_j^{\text {H}} ||_2 + h(d-1) \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{d-2} \,\\&\quad \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F \, ||\mathbf{w}_j^{\text {H}} ||_2. \end{aligned}$$

In order to use the estimates (6.4)–(6.5) in the above inequality, we first note that if for three nonnegative numbers (a, b, c) we have $a \le b+c$, then $a^2 \le 2 b^2 + 2 c^2$. We hence write

$$\begin{aligned}&|| \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2 \le 2 \, \max _{k=1, \dotsc , d-1} || Q_k ||_F^{2(d-1)} \, || \mathbf{w}_j - \mathbf{w}_j^{\text {H}} ||_2^2 \\&\quad + 2 \, h(d-1)^2 \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{2(d-2)} \, \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F^2 \, ||\mathbf{w}_j^{\text {H}} ||_2^2. \end{aligned}$$

We then utilize the connection between the $L^2$-norm of rows $\tilde{\mathbf{w}}_i$ and columns $\mathbf{w}_j$ of a matrix

$$\begin{aligned} \sum _{i} || \tilde{\mathbf{w}}_i ||_2^2 = \sum _j || \mathbf{w}_j ||_2^2, \end{aligned}$$

and, thanks to (6.4)–(6.5), write

$$\begin{aligned}&\sum _j || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2 \le 2 \, \max _{k=1, \dotsc , d-1} || Q_k ||_F^{2(d-1)} \, || Q_d - Q_d^{\text {H}} ||_{F}^2 \, \sum _i || \tilde{\mathbf{v}}_i ||_2^2 \\&\quad + 2 \, h(d-1)^2 \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{2(d-2)} \, \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F^2 \, || Q_d ||_F^2 \, \sum _i || \tilde{\mathbf{v}}_i ||_2^2. \end{aligned}$$

Setting $\mathbf{s}:= Q \mathbf{v}$ and $\mathbf{s}^{\text {H}}:= Q^{\text {H}} \mathbf{v}$, and noting that $|| \mathbf{s} - \mathbf{s}^{\text {H}} ||_2^2 = \sum _j || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2$ and that $|| \mathbf{v} ||_2^2 = \sum _i || \tilde{\mathbf{v}}_i ||_2^2$, we obtain

$$\begin{aligned} || \mathbf{s} - \mathbf{s}^{\text {H}} ||_2^2 \le 2 (1 + h(d-1)^2) \, \max _{k=1, \dotsc , d} || Q_k ||_F^{2(d-1)} \, \max _{k=1, \dotsc , d} || Q_k - Q_k^{\text {H}}||_F^2 \, || \mathbf{v} ||_2^2. \end{aligned}$$

It remains to note that the solution of the recursive equation $h(d)^2 = 2 (1 + h(d-1)^2)$ with $h(1)=1$ is given by $h(d) = (2^d + 2^{d-1}-2)^{1/2}$, as desired. $\square $

Proof of Lemma 4.2

By (2.13) and (2.12), we first write

$$\begin{aligned} \varepsilon _{II}&= |SD_{\varepsilon , K}(\mathbf{f}_n, \mathbf{g}_n)^p - SD_{\varepsilon , K}^{\text {H}}(\mathbf{f}_n, \mathbf{g}_n)^p | \\&\le |T_{\varepsilon , K}(\mathbf{f}_n, \mathbf{g}_n) - T_{\varepsilon , K}^{H}(\mathbf{f}_n, \mathbf{g}_n) | \\&\quad +0.5 \, |T_{\varepsilon , K}(\mathbf{f}_n, \mathbf{f}_n) - T_{\varepsilon , K}^{H} (\mathbf{f}_n, \mathbf{f}_n) | + 0.5 \, | T_{\varepsilon , K}(\mathbf{g}_n, \mathbf{g}_n) - T_{\varepsilon , K}^{H} (\mathbf{g}_n, \mathbf{g}_n)| \\&\le 2 \, |T_{\varepsilon , K}(\mathbf{f}_n, \mathbf{g}_n) - T_{\varepsilon , K}^{H}(\mathbf{f}_n, \mathbf{g}_n) | \\&\le 2 \, \varepsilon \, \left( | \mathbf{f}_n^{\top } (\ln \mathbf{u}^{(K)} -\ln \mathbf{u}^{H (K)}) | + | \mathbf{g}_n^{\top } (\ln \mathbf{v}^{(K)} -\ln \mathbf{v}^{H (K)}) | \right) \\&= 2 \, \varepsilon \, \left( | \mathbf{f}_n^{\top } \, \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) | + | \mathbf{g}_n^{\top } \, \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) | \right) , \end{aligned}$$

where in the second inequality, we have assumed without loss of generality that the approximation error in the entropic optimal cost of $(\mathbf{f}_n, \mathbf{g}_n)$ is larger than or equal to that of $(\mathbf{f}_n, \mathbf{f}_n)$ and $(\mathbf{g}_n, \mathbf{g}_n)$. By Cauchy-Schwarz inequality $|\mathbf{v}^{\top } \mathbf{w}| \le || \mathbf{v} ||_2 \, || \mathbf{w} ||_2$, we then get

$$\begin{aligned} \varepsilon _{II} \le 2 \, \varepsilon \, \Bigl ( || \mathbf{f}_n ||_2 \, || \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) ||_2 + || \mathbf{g}_n ||_2 \, || \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) ||_2 \Bigr ). \end{aligned}$$

(6.6)

We next find an upper bound for $|| \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) ||_2$. By the first Sinkhorn’s formula (2.10), we have

$$\begin{aligned}&\mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} = \bigl ( \mathbf{f}_n \, \oslash \, (Q \, \mathbf{v}^{(i-1)}) \bigr ) \oslash \bigl ( \mathbf{f}_n \, \oslash \, (Q^H \, \mathbf{v}^{H (i-1)}) \bigr ) = {\mathbb {1}}_n \\&+ \bigl ( Q^H \, \mathbf{v}^{H (i-1)} - Q \, \mathbf{v}^{(i-1)}\bigr ) \oslash (Q \, \mathbf{v}^{(i-1)}). \end{aligned}$$

Then, by the logarithm inequality $\ln (1+\beta ) \le \beta $, for $\beta > -1$, we get

$$\begin{aligned} | \ln ( \mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} ) | \le | Q^H \, \mathbf{v}^{H (i-1)} - Q \, \mathbf{v}^{(i-1)} | \oslash | Q \, \mathbf{v}^{(i-1)}| \le \frac{1}{\omega } \, | Q^H \, \mathbf{v}^{H (i-1)} - Q \, \mathbf{v}^{(i-1)} |, \end{aligned}$$

with $\omega >0$ as in the statement of the lemma. Hence,

$$\begin{aligned} || \ln ( \mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} ) ||_2&\le \frac{1}{\omega } \, || Q \, \mathbf{v}^{(i-1)}) - Q^H \, \mathbf{v}^{H (i-1)} ||_2 \nonumber \\&\le \frac{1}{\omega } \, || Q \, \mathbf{v}^{(i-1)} - Q^H \, \mathbf{v}^{(i-1)} ||_2 + \frac{1}{\omega } \, || Q^H \, \mathbf{v}^{(i-1)} - Q^H \, \mathbf{v}^{H (i-1)} ||_2 \nonumber \\&\le \frac{1}{\omega } \, h(d) \, || \mathbf{v}^{(i-1)} ||_2 \, \hat{\varepsilon } + \frac{1}{\omega } \, || Q^H ||_F \, || \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} ||_2, \end{aligned}$$

(6.7)

where the second inequality is the triangle inequality, and the third inequality follows from Lemma 4.1 and (4.5) and the compatibility of the Frobenius matrix norm with Euclidean vector norm $||A \mathbf{v} ||_2 \le ||A||_{\text {F}} ||\mathbf{v}||_2$. To further bound the second term of the above inequality, we use the second Sinkhorn’s formula (2.10) and write

$$\begin{aligned} \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)}&= \mathbf{g}_n \, \oslash \, (Q^{\top } \, \mathbf{u}^{(i-1)}) - \mathbf{g}_n \, \oslash \, (Q^{H^{\top }} \, \mathbf{u}^{H (i-1)})\\&= \Bigl ( \mathbf{g}_n \, \oslash \bigl ( (Q^{\top } \, \mathbf{u}^{(i-1)}) \odot (Q^{H^{\top }} \, \mathbf{u}^{H (i-1)}) \bigr ) \Bigr ) \odot (Q^{\top } \, \mathbf{u}^{(i-1)}\\&\quad - Q^{H^{\top }} \, \mathbf{u}^{H (i-1)}). \end{aligned}$$

Hence, noting that $\max \mathbf{g}_n \le 1$, we obtain

$$\begin{aligned} | \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} | \le \frac{1}{\omega ^2} \, | Q^{\top } \, \mathbf{u}^{(i-1)} - Q^{H^{\top }} \, \mathbf{u}^{H (i-1)}|. \end{aligned}$$

By the triangle inequality and Lemma 4.1 and (4.5), we then get

$$\begin{aligned} || \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} ) ||_2&\le \frac{1}{\omega ^2} \, || Q^{\top } \, \mathbf{u}^{(i-1)} - Q^{H^{\top }} \, \mathbf{u}^{H(i-1)} ||_2 \\&\le \frac{1}{\omega ^2} \, || Q^{\top } \, \mathbf{u}^{(i-1)} - Q^{H^{\top }} \, \mathbf{u}^{(i-1)} ||_2 + \frac{1}{\omega ^2} \, || Q^{H^{\top }} \, \mathbf{u}^{(i-1)} \\&\quad - Q^{H^{\top }} \, \mathbf{u}^{H (i-1)} ||_2 \\&\le \frac{1}{\omega ^2} \, h(d) \, || \mathbf{u}^{(i-1)} ||_2 \, \hat{\varepsilon } + \frac{1}{\omega ^2} \, || Q^H ||_F \, || \mathbf{u}^{(i-1)} - \mathbf{u}^{H (i-1)} ||_2. \end{aligned}$$

Using the logarithm inequality $1-1/\beta \le \ln \beta $ for $\beta >0$, and setting $\beta $ to be the elements of the vector $\mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)}$, we have

$$\begin{aligned} || \mathbf{u}^{(i-1)} - \mathbf{u}^{H (i-1)} ||_2\le & {} \max \mathbf{u}^{(i-1)} \, || \ln ( \mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)} ) ||_2\\\le & {} \frac{1}{\omega } \, || \ln ( \mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)} ) ||_2, \end{aligned}$$

where we have used $\max \mathbf{u}^{(i-1)} \le 1 / \omega $, which follows by (2.10) and since $\max \mathbf{f}_n \le 1$. Noting that $|| \mathbf{u}^{(i-1)} ||_2 \le n / \omega $, the last two inequalities above therefore give us

$$\begin{aligned} || \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} ) ||_2 \le \frac{n}{\omega ^3} \, h(d) \, \hat{\varepsilon } + \frac{1}{\omega ^3} \, || Q^H ||_F \, || \ln ( \mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)} ) ||_2.\nonumber \\ \end{aligned}$$

(6.8)

Now, setting

$$\begin{aligned} \delta _\mathbf{u}^{(i)} := || \ln ( \mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} ) ||_2, \end{aligned}$$

by (6.7) and (6.8), and noting that $|| \mathbf{v}^{(i-1)} ||_2 \le n / \omega $, we obtain the recursive formula

$$\begin{aligned} \delta _\mathbf{u}^{(i)} \le c_1 \, \Bigl ( n \, \hat{\varepsilon } + n \, \hat{\varepsilon } \, || Q^H ||_F + || Q^H ||_F^2 \, \delta _\mathbf{u}^{(i-1)}\Bigr ), \end{aligned}$$

where $c_1$ is a constant that depends only on $\omega $ and d. Noting that $\delta _\mathbf{u}^{(0)} = 0$, we finally obtain

$$\begin{aligned} \delta _\mathbf{u}^{(K)} \le c_2 \, n \, \hat{\varepsilon } \, \Bigl ( 1+ || Q^H ||_F + || Q^H ||_F^2 + \dotsc + || Q^H ||_F^K \Bigr ), \end{aligned}$$

where $c_2$ is a constant that depends only on $c_1$. Finally, by the sum of a finite geometric series, we get

$$\begin{aligned} \delta _\mathbf{u}^{(K)} = || \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)} ) ||_2 \le c_3 \, n \, \hat{\varepsilon } \, || Q^H ||_F^K, \end{aligned}$$

(6.9)

where $c_3$ is a constant that depends only on $c_2$, that is, on $\omega $ and d. The desired estimate (4.6) follows by (6.6) and (6.9), and noting that a similar estimate as (6.9) holds for $\delta _\mathbf{v}^{(K)} := || \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) ||_2$. $\square $

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Motamed, M. A hierarchically low-rank optimal transport dissimilarity measure for structured data. Bit Numer Math 62, 1945–1982 (2022). https://doi.org/10.1007/s10543-022-00937-9

Download citation

Received: 01 February 2022
Accepted: 29 August 2022
Published: 08 September 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10543-022-00937-9

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hierarchically low-rank optimal transport dissimilarity measure for structured data

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Random Gradient-Free Minimization of Convex Functions

A class of accelerated GADMM-based method for multi-block nonconvex optimization problems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Proof of Lemma 3.1

Proof of Lemma 3.2

Proof of Lemma 4.1

Proof of Lemma 4.2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A hierarchically low-rank optimal transport dissimilarity measure for structured data

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Random Gradient-Free Minimization of Convex Functions

A class of accelerated GADMM-based method for multi-block nonconvex optimization problems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Proof of Lemma 3.1

Proof of Lemma 3.2

Proof of Lemma 4.1

Proof of Lemma 4.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation