Skip to main content
Log in

A hierarchically low-rank optimal transport dissimilarity measure for structured data

  • Published:
BIT Numerical Mathematics Aims and scope Submit manuscript

Abstract

We develop a class of hierarchically low-rank, scalable optimal transport dissimilarity measures for structured data, bringing the current state-of-the-art optimal transport solvers to a higher level of performance. Given two n-dimensional discrete probability measures supported on two structured grids in \({\mathbb {R}}^d\), we present a fast method for computing an entropically regularized optimal transport distance, referred to as the debiased Sinkhorn distance. The method combines Sinkhorn’s matrix scaling iteration with a low-rank hierarchical representation of the scaling matrices to achieve a near-linear complexity \({{\mathscr {O}}}(n \ln ^4 n)\). This provides a fast, scalable, and easy-to-implement algorithm for computing a class of optimal transport dissimilarity measures, enabling their applicability to large-scale optimization problems, where the computation of the classical Wasserstein metric is not feasible. We carry out a rigorous error-complexity analysis for the proposed algorithm and present several numerical examples to verify the accuracy and efficiency of the algorithm and to demonstrate its applicability in tackling real-world problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. For example https://github.com/ecrc/hicma.

References

  1. Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds) Advances in Neural Information Processing Systems, Vol. 30, pp. 1961–1971 (2017)

  2. Ambrosio, L., Gigli, N.: A user’s guide to optimal transport. In: Modelling and Optimisation of Flows on Networks, pp. 1–155. Springer (2013)

  3. Ambrosio, L., Gigli, N., Savaré, G.: Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhäser Verlag, New York (2005)

    MATH  Google Scholar 

  4. Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. ACM 12, 547–560 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  5. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 214–223 (2017)

  6. Baringhaus, L., Franz, C.: On a new multivariate two-sample test. J. Multivar. Anal. 88, 190–206 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  7. Basu, S., Kolouri, S., Rohde, G.K.: Detecting and visualizing cell phenotype differences from microscopy images using transport-based morphometry. In: Proceedings of the National Academy of Sciences, pp. 3448–3453 (2014)

  8. Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., Peyré, G.: Iterative Bregman projections for regularized transportation problems. SIAM J. Sci. Comput. 37, A1111–A1138 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  9. Benamou, J.-D., Froese, B.D., Oberman, A.M.: Numerical solution of the optimal transportation problem using the Monge–Ampère equation. J. Comput. Phys. 260, 107–126 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. Athena Scientific, New York (1997)

    Google Scholar 

  11. Beukema, P., Verstynen, T.D.: OpenNeuro Dataset ds001597:v1.0.0 (multiFingerRSA). OpenNeuro (2018)

  12. Burkard, R., Dell’Amico, M., Martello, S.: Assignment Problems: Revised Reprint. Society for Industrial and Applied Mathematics (SIAM) (2012)

  13. Chellaboina, V., Haddad, W.M.: Is the Frobenius matrix norm induced? IEEE Trans. Autom. Control 40, 2137–2139 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  14. Chizat, L., Roussillon, P., Léger, F., Vialard, F.-X., Peyré, G.: Faster Wasserstein distance estimation with the Sinkhorn divergence. arXiv:2006.08172 (2020)

  15. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 26, 2292–2300 (2013)

    Google Scholar 

  16. Dvurechensky, P., Gasnikov, A., Kroshnin, A.: Computational optimal transport: complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In: Dy, J., Krause, A. (eds) Proceedings of the 35th International Conference on Machine Learning, PMLR 80, pp. 1367–1376 (2018)

  17. Engquist, B., Froese, B.D.: Application of the Wasserstein metric to seismic signals. Commun. Math. Sci. 12(5), 979–988 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  18. Engquist, B., Froese, B.D., Yang, Y.: Optimal transport for seismic full waveform inversion. Commun. Math. Sci. 14(8), 2309–2330 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  19. Ferradans, S., Xia, G.-S., Peyré, G., Aujol, J.-F.: Static and dynamic texture mixing using optimal transport. In: Scale Space and Variational Methods in Computer Vision, pp. 137–148 (2013)

  20. Feydy, J., Vialard, F.-X., Séjourné, T., Amari, S., Trouvé, A., Peyré, G.: Interpolating between optimal transport and MMD using Sinkhorn divergences. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2681–2690 (2019)

  21. Franklin, J., Lorenz, J.: On the scaling of multidimensional matrices. Linear Algebra Appl. 114, 717–735 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  22. Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 28, 2053–2061 (2015)

    Google Scholar 

  23. Genevay, A.,Cuturi, M., Peyré, G., Bach, F.: Stochastic optimization for large-scale optimal transport. In: Advances in Neural Information Processing Systems, pp. 3432–3440 (2016)

  24. Genevay, A., Peyré, G., Cuturi, M.: Learning generative models with Sinkhorn divergences. In: The 21st International Conference on Artificial Intelligence and Statistics, pp. 1608–1617 (2018)

  25. Gottschlich, C., Schuhmacher, D.: The shortlist method for fast computation of the earth mover’s distance and finding optimal solutions to transportation problems. PLOS ONE 9 (2014)

  26. Gramfort, A., Peyré, G., Cuturi, M.: Fast optimal transport averaging of neuroimaging data. In: Ourselin, S., Alexander, D., Westin, C.F., Cardoso, M. (eds) Information Processing in Medical Imaging, IPMI 2015, pp. 261–272 (2015)

  27. Gretton, A., Borgwardt, K.M., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 19, 513–520 (2006)

    MATH  Google Scholar 

  28. Hackbusch, W.: Hierarchical Matrices: Algorithms and Analysis Springer Series in Computational Mathematics, vol. 49. Springer, Berlin (2015)

    Book  Google Scholar 

  29. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)

    Book  Google Scholar 

  30. Joyce, D.C.: Survey of extrapolation processes in numerical analysis. SIAM Rev. 13, 435–490 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  31. Kalantari, B., Khachiyan, L.: On the complexity of nonnegative-matrix scaling. Linear Algebra Appl. 240, 87–103 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  32. Kantorovich, L.V.: On translation of mass (in Russian). Doklady Acad. Sci. USSR 37, 199–201 (1942)

    Google Scholar 

  33. Keyes, D.E., Ltaief, H., Turkiyyah, G.: Hierarchical algorithms on hierarchical architectures. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 378(2166), 20190055 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  34. Knight, P.A.: The Sinkhorn–Knopp algorithm: convergence and applications. SIAM J. Matrix Anal. Appl. 30, 261–275 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  35. Kolouri, S., Park, S., Thorpe, M., Slepčev, D., Rohde, G.K.: Optimal mass transport: signal processing and machine-learning applications. IEEE Signal Process. Mag. 34, 43–59 (2017)

    Article  Google Scholar 

  36. Lellmann, J., Lorenz, D.A., Schönlieb, C., Valkonen, T.: Imaging with Kantorovich–Rubinstein discrepancy. SIAM J. Imaging Sci. 7, 2833–2859 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  37. Léonard, C.: A survey of the Schrödinger problem and some of its connections with optimal transport. Discrete Contin. Dyn. Syst. A 34, 1533–1574 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  38. Li, P., Wang, Q., Zhang, L.: A novel earth mover’s distance methodology for image matching with Gaussian mixture models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1689–1696 (2013)

  39. Mérigot, Q.: A multiscale approach to optimal transport. In: Computer Graphics Forum, Vol. 30, pp. 1583–1592. Wiley (2011)

  40. Monge, G.: Mémoire sur la théorie des déblais et des remblais. De l’Imprimerie Royale (1781)

  41. Montavon, G., Müller, K.-R., Cuturi, M.: Wasserstein training of restricted Boltzmann machines. Adv. Neural Inf. Process. Syst. 29, 3718–3726 (2016)

    Google Scholar 

  42. Moselhy, E.T.A., Marzouk, Y.M.: Bayesian inference with optimal maps. J. Comput. Phys. 231, 7815–7850 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  43. Motamed, M., Appelö, D.: Wasserstein metric-driven Bayesian inversion with applications to signal processing. Int. J. Uncertain. Quantif. 9, 395–414 (2019)

    Article  MathSciNet  Google Scholar 

  44. Nemirovski, A., Rothblum, U.: On complexity of matrix scaling. Linear Algebra Appl. 302, 435–460 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  45. Oberman, A.M., Ruan, Y.: An efficient linear programming method for optimal transportation. arXiv:1509.03668 (2015)

  46. Panaretos, V.M., Zemel, Y.: Statistical aspects of wasserstein distances. Ann. Rev. Stat. Appl. 6, 405–431 (2019)

    Article  MathSciNet  Google Scholar 

  47. Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach. Learn. 11, 355–607 (2019)

    Article  MATH  Google Scholar 

  48. Ramdas, A., Trillos, N.G., Cuturi, M.: On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 19, 47 (2017)

    Article  MathSciNet  Google Scholar 

  49. Richardson, L.F.: The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam. Philos. Trans. R. Soc. Lond. Ser. A 210, 307–357 (1911)

    Article  MATH  Google Scholar 

  50. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)

    Article  MATH  Google Scholar 

  51. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 99–121 (2000)

    Article  MATH  Google Scholar 

  52. Salimans, T., Zhang, H., Radford, A., Metaxas, D.: Improving GANs using optimal transport. In: International Conference on Learning Representations (2018)

  53. Sanjabi, M., Ba, J., Razaviyayn, M., Lee, J. D.: On the convergence and robustness of training GANs with regularized optimal transport. In: Advances in Neural Information Processing Systems 31 (2018)

  54. Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkäuser, New York (2015)

    Book  MATH  Google Scholar 

  55. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162, 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  56. Schmitzer, B.: A sparse multiscale algorithm for dense optimal transport. J. Math. Imaging Vis. 56, 238–259 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  57. Schrödinger, E.: Über die umkehrung der naturgesetze. Sitzungsberichte Preuss. Akad. Wiss. Berlin. Phys. Math. 144:144–153 (1931)

  58. Scieur, D., d’Aspremont, A., Bach, F.: Regularized nonlinear acceleration. Adv. Neural Inf. Process. Syst. 29, 712–720 (2016)

    MATH  Google Scholar 

  59. Sinkhorn, R.: A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 35, 876–879 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  60. Sinkhorn, R., Knopp, P.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21, 343–348 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  61. Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., Guibas, L.: Convolutional Wasserstein distances: efficient optimal transportation on geometric domains. ACM Trans. Graph. 34, 66:1-66:11 (2015)

    Article  MATH  Google Scholar 

  62. Székely, G.J., Rizzo, M.L.: Testing for equal distributions in high dimension. InterStat (2004)

  63. Thibault, A., Chizat, L., Dossal, C. H., Papadakis, N.: Overrelaxed Sinkhorn–Knopp algorithm for regularized optimal transport. In: NIPS’17 Workshop on Optimal Transport and Machine Learning, Long Beach, United States (2017)

  64. Villani, C.: Topics in Optimal Transportation, volume 58 of Graduate Studies in Mathematics. American Mathematical Society (2003)

  65. Villani, C.: Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer (2009)

  66. Yan, J., Deng, C., Luo, L., Wang, X., Yao, X., Shen, L., Huang, H.: Identifying imaging markers for predicting cognitive assessments using Wasserstein distances based matrix regression. Front. Neurosci. 13, 668 (2019)

    Article  Google Scholar 

  67. Yang, Y., Engquist, B., Sun, J., Hamfeldt, B.F.: Application of optimal transport and the quadratic Wasserstein metric to full-waveform inversion. Geophysics 83(1), R43–R62 (2018)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Motamed.

Ethics declarations

Conflict of interest

Not Applicable.

Additional information

Communicated by Elias Jarlebring.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Lemma 3.1

We first show that for real \(n_k \times n_k\) matrices \(C^{(k)} = [C_{ij}^{(k)}]\), with \(k=1, \dotsc , d\), we have

$$\begin{aligned} \exp [C^{(d)} \oplus \cdots \oplus C^{(1)}] = \exp [C^{(d)}] \otimes \cdots \otimes \exp [C^{(1)}], \end{aligned}$$
(6.1)

where \(\exp [\, \cdot \, ]\) denotes element-wise exponentiation, and \(\oplus \) denotes the “all-ones Kronecker sum” operator in Definition 3.1. The case \(d=1\) is trivial. Consider the case \(d=2\). By the definition of the all-ones Kronecker sum, we have

$$\begin{aligned}&\exp [C^{(2)} \oplus C^{(1)}] = \exp [C^{(2)} \otimes J_1 + J_2 \otimes C^{(1)}]= \\&\quad = \left( \begin{array}{lll} \exp (C_{11}^{(2)}) J_1 &{} \cdots &{} \exp (C_{1n_2}^{(2)}) J_1\\ \hline \vdots &{} \ddots &{} \vdots \\ \hline \exp (C_{n_2 1}^{(2)}) J_1 &{} \cdots &{} \exp (C_{n_2 n_2}^{(2)}) J_1 \end{array}\right) \\&\quad \odot \left( \begin{array}{lll} \exp [C^{(1)}] &{} \cdots &{} \exp [C^{(1)}]\\ \hline \vdots &{} \ddots &{} \vdots \\ \hline \exp [C^{(1)}] &{} \cdots &{} \exp [C^{(1)}] \end{array}\right) = \\&\quad = \left( \begin{array}{lll} \exp (C_{11}^{(2)}) \exp [C^{(1)}] &{} \cdots &{} \exp (C_{1n_2}^{(2)}) \exp [C^{(1)}]\\ \hline \vdots &{} \ddots &{} \vdots \\ \hline \exp (C_{n_2 1}^{(2)}) \exp [C^{(1)}] &{} \cdots &{} \exp (C_{n_2 n_2}^{(2)}) \exp [C^{(1)}] \end{array}\right) = \\&\quad =\exp [C^{(2)}] \otimes \exp [C^{(1)}]. \end{aligned}$$

Now assume that it is true for \(d-1\):

$$\begin{aligned} \exp [C^{(d-1)} \oplus \cdots \oplus C^{(1)}] = \exp [C^{(d-1)}] \otimes \cdots \otimes \exp [C^{(1)}]. \end{aligned}$$

Then

$$\begin{aligned}&\exp [C^{(d)} \oplus \cdots \oplus C^{(1)}] = \exp [C^{(d)}] \otimes \exp [C^{(d-1)} \oplus \cdots \oplus C^{(1)}]\\&\quad = \exp [C^{(d)}] \otimes \cdots \otimes \exp [C^{(1)}]. \end{aligned}$$

The proof of Lemma 3.1 simply follows from (6.1) and (3.1). \(\square \)

Proof of Lemma 3.2

We utilize the vectorization operator \(\text {vec}(.)\) that converts \(m\times n\) matrices into \(mn \times 1\) vectors, and denote by \(\mathbf{v} =\text {vec}(V)\) the vector obtained by stacking the columns of matrix V beneath one another. We proceed by induction on the dimension d, and note that by Lemma 3.1 the kernel matrix Q is of the block form (3.3). The case \(d=1\) is trivial, noting that \({\hat{n}}_1 = 1\). Let \(d=2\) and \(Q = Q_2 \otimes Q_1 \in {{\mathbb {R}}}_+^{n \times n}\), with \(Q_1 \in {\mathbb R}_+^{n_1 \times n_1}\), \(Q_2 \in {{\mathbb {R}}}_+^{n_2 \times n_2}\), and \(n = n_1 n_2\). Let further \(V = \text {vec}^{-1}(\mathbf{v}) \in {{\mathbb {R}}}^{n_1 \times n_2}\) be the matrix whose vectorization gives the vector \(\mathbf{v} =\text {vec}(V) \in {{\mathbb {R}}}^{n_1 n_2 \times 1 }\). Then, we have

$$\begin{aligned} Q \mathbf{v} = (Q_2 \otimes Q_1) \, {\text {vec}}(V) = \text {vec}(Q_1 V Q_2^{\top }), \end{aligned}$$

and computing \(Q_1 V Q_2^{\top }\) can be carried out in two steps. The first step \(W:= V Q_2^{\top }\) requires \(n_1\) multiplications \(Q_2 \tilde{\mathbf{v}}_i\), \(i=1, \dotsc , n_1\), where \(\tilde{\mathbf{v}}_i\) is the i-th of row of V in column-vector form. The second step \(Q_1 W\) requires \(n_2\) multiplications \(Q_1 \mathbf{w}_j\), \(j=1, \dotsc , n_2\), where \(\mathbf{w}_j\) is the j-th column of W. Overall, we need \(n_1\) matrix-vector multiplications \(Q_2 \mathbf{z}\) for \(n_1\) different vectors \(\mathbf{z}\), and \(n_2\) matrix-vector multiplications \(Q_1 \mathbf{z}\) for \(n_2\) different vectors \(\mathbf{z}\), as desired. This can be recursively extended to higher dimensions (\(d \ge 3\)), using the associative property of the kronecker product \(A \otimes B \otimes C = A \otimes (B \otimes C)\). Setting \({\hat{Q}}_d:=Q_{d-1} \otimes \cdots \otimes Q_1\), we can write

$$\begin{aligned} Q \mathbf{v} = (Q_d \otimes Q_{d-1} \otimes \cdots \otimes Q_1) \mathbf{v} = (Q_d \otimes {\hat{Q}}_d) \, {\text {vec}}(V) = \text {vec}({\hat{Q}}_d V Q_d^{\top }). \end{aligned}$$
(6.2)

This can again be done in two steps. We first compute \(W:=V Q_d^{\top } \in {{\mathbb {R}}}^{{\hat{n}}_d \times n_d}\) that requires \({\hat{n}}_d\) multiplications \(Q_d \tilde{\mathbf{v}}_k\), where \(\tilde{\mathbf{v}}_i \in {{\mathbb {R}}}^{n_d \times 1}\) is the i-th row of matrix V in the form of a column vector, with \(i=1, \dotsc , {\hat{n}}_d\). We then compute \({\hat{Q}}_d W\) that requires \(n_d\) matrix-vector multiplications \({\hat{Q}}_d \mathbf{w}_j\), where \(\mathbf{w}_j \in {\mathbb R}^{{\tilde{n}} \times 1}\) is the j-th column of matrix W, with \(j=1, \dotsc , n_d\). Noting that each multiplication \({\hat{Q}}_d \mathbf{w}_j\) is of the same form as (6.2) but in \(d-1\) dimensions, the lemma follows easily by induction. \(\square \)

Proof of Lemma 4.1

The proof follows by induction. For the case \(d=1\), we simply note that

$$\begin{aligned} || Q \mathbf{v} - Q^{\text {H}} \mathbf{v} ||_2 = || (Q - Q^{\text {H}}) \mathbf{v} ||_2 \le || Q - Q^{\text {H}} ||_{F} \, || \mathbf{v} ||_2. \end{aligned}$$

We next consider \(d \ge 2\) and assume that the lemma holds for the case \(d-1\),

$$\begin{aligned} || {\hat{Q}}_d \, \mathbf{z} - {\hat{Q}}_d^{\text {H}} \, \mathbf{z} ||_2 \le h(d-1) \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{d-2} \, \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F \, ||\mathbf{z} ||_2,\nonumber \\ \end{aligned}$$
(6.3)

where \({\hat{Q}}_d:= Q_{d-1} \otimes \cdots \otimes Q_1\) and \(\mathbf{z} \in {{\mathbb {R}}}^{{\hat{n}}_d}\), with \({\hat{n}}_d = n/n_d\). We will show the lemma also holds for the case d. We first note that

$$\begin{aligned} Q \mathbf{v} = \text {vec}({\hat{Q}}_d \, V \, Q_d^{\top }), \qquad V:= \text {vec}^{-1}(\mathbf{v}). \end{aligned}$$

Following the same notation used in the proof of Lemma 3.2 and in Function 1, we denote by \(\tilde{\mathbf{v}}_i\) and \(\tilde{\mathbf{w}}_i\) the i-th rows of matrices V and \(W = V \, Q_d^{\top }\) in column-vector forms, respectively, and by \(\mathbf{v}_j\) and \(\mathbf{w}_j\) the j-th columns of those two matrices. We then follow the two-step strategy in Function 1 for computing \(Q \mathbf{v}\). The first step is to approximate \(\tilde{\mathbf{w}}_i := Q_d \, \tilde{\mathbf{v}}_i\) by \(\tilde{\mathbf{w}}_i^{\text {H}} := Q_d^{\text {H}} \, \tilde{\mathbf{v}}_i\). This gives us

$$\begin{aligned} || \tilde{\mathbf{w}}_i - \tilde{\mathbf{w}}_i^{\text {H}} ||_2 = || (Q_d - Q_d^{\text {H}}) \tilde{\mathbf{v}}_i ||_2 \le || Q_d - Q_d^{\text {H}} ||_{F} \, || \tilde{\mathbf{v}}_i ||_2, \end{aligned}$$
(6.4)

and

$$\begin{aligned} || \tilde{\mathbf{w}}_i^{\text {H}} ||_2 \le || Q_d ||_F \, || \tilde{\mathbf{v}}_i ||_2. \end{aligned}$$
(6.5)

The second step is to approximate \(\mathbf{s}_j := {\hat{Q}}_d \, \mathbf{w}_j\) by \(\mathbf{s}_j^{\text {H}} := {\hat{Q}}_d^{\text {H}} \, \mathbf{w}_j^{\text {H}}\). This gives us, by triangle inequality,

$$\begin{aligned} || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2 = || {\hat{Q}}_d \, \mathbf{w}_j - {\hat{Q}}_d^{\text {H}} \, \mathbf{w}_j^{\text {H}}||_2 \le || {\hat{Q}}_d \, (\mathbf{w}_j - \mathbf{w}_j^{\text {H}}) ||_2 + || ({\hat{Q}}_d - {\hat{Q}}_d^{\text {H}}) \, \mathbf{w}_j^{\text {H}} ||_2 \end{aligned}$$

The first term in the right hand side of the above inequality can be bounded using the compatibility and submultiplicativity of the Frobenius norm, and the second term can be bounded by the inductive hypothesis (6.3), as follows

$$\begin{aligned}&|| \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2 \le \max _{k=1, \dotsc , d-1} || Q_k ||_F^{d-1} \, || \mathbf{w}_j - \mathbf{w}_j^{\text {H}} ||_2 + h(d-1) \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{d-2} \,\\&\quad \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F \, ||\mathbf{w}_j^{\text {H}} ||_2. \end{aligned}$$

In order to use the estimates (6.4)–(6.5) in the above inequality, we first note that if for three nonnegative numbers (abc) we have \(a \le b+c\), then \(a^2 \le 2 b^2 + 2 c^2\). We hence write

$$\begin{aligned}&|| \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2 \le 2 \, \max _{k=1, \dotsc , d-1} || Q_k ||_F^{2(d-1)} \, || \mathbf{w}_j - \mathbf{w}_j^{\text {H}} ||_2^2 \\&\quad + 2 \, h(d-1)^2 \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{2(d-2)} \, \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F^2 \, ||\mathbf{w}_j^{\text {H}} ||_2^2. \end{aligned}$$

We then utilize the connection between the \(L^2\)-norm of rows \(\tilde{\mathbf{w}}_i\) and columns \(\mathbf{w}_j\) of a matrix

$$\begin{aligned} \sum _{i} || \tilde{\mathbf{w}}_i ||_2^2 = \sum _j || \mathbf{w}_j ||_2^2, \end{aligned}$$

and, thanks to (6.4)–(6.5), write

$$\begin{aligned}&\sum _j || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2 \le 2 \, \max _{k=1, \dotsc , d-1} || Q_k ||_F^{2(d-1)} \, || Q_d - Q_d^{\text {H}} ||_{F}^2 \, \sum _i || \tilde{\mathbf{v}}_i ||_2^2 \\&\quad + 2 \, h(d-1)^2 \, \max _{k=1, \dotsc , d-1} || Q_k||_F^{2(d-2)} \, \max _{k=1, \dotsc , d-1} || Q_k - Q_k^{\text {H}}||_F^2 \, || Q_d ||_F^2 \, \sum _i || \tilde{\mathbf{v}}_i ||_2^2. \end{aligned}$$

Setting \(\mathbf{s}:= Q \mathbf{v}\) and \(\mathbf{s}^{\text {H}}:= Q^{\text {H}} \mathbf{v}\), and noting that \(|| \mathbf{s} - \mathbf{s}^{\text {H}} ||_2^2 = \sum _j || \mathbf{s}_j - \mathbf{s}_j^{\text {H}} ||_2^2\) and that \(|| \mathbf{v} ||_2^2 = \sum _i || \tilde{\mathbf{v}}_i ||_2^2\), we obtain

$$\begin{aligned} || \mathbf{s} - \mathbf{s}^{\text {H}} ||_2^2 \le 2 (1 + h(d-1)^2) \, \max _{k=1, \dotsc , d} || Q_k ||_F^{2(d-1)} \, \max _{k=1, \dotsc , d} || Q_k - Q_k^{\text {H}}||_F^2 \, || \mathbf{v} ||_2^2. \end{aligned}$$

It remains to note that the solution of the recursive equation \(h(d)^2 = 2 (1 + h(d-1)^2)\) with \(h(1)=1\) is given by \(h(d) = (2^d + 2^{d-1}-2)^{1/2}\), as desired. \(\square \)

Proof of Lemma 4.2

By (2.13) and (2.12), we first write

$$\begin{aligned} \varepsilon _{II}&= |SD_{\varepsilon , K}(\mathbf{f}_n, \mathbf{g}_n)^p - SD_{\varepsilon , K}^{\text {H}}(\mathbf{f}_n, \mathbf{g}_n)^p | \\&\le |T_{\varepsilon , K}(\mathbf{f}_n, \mathbf{g}_n) - T_{\varepsilon , K}^{H}(\mathbf{f}_n, \mathbf{g}_n) | \\&\quad +0.5 \, |T_{\varepsilon , K}(\mathbf{f}_n, \mathbf{f}_n) - T_{\varepsilon , K}^{H} (\mathbf{f}_n, \mathbf{f}_n) | + 0.5 \, | T_{\varepsilon , K}(\mathbf{g}_n, \mathbf{g}_n) - T_{\varepsilon , K}^{H} (\mathbf{g}_n, \mathbf{g}_n)| \\&\le 2 \, |T_{\varepsilon , K}(\mathbf{f}_n, \mathbf{g}_n) - T_{\varepsilon , K}^{H}(\mathbf{f}_n, \mathbf{g}_n) | \\&\le 2 \, \varepsilon \, \left( | \mathbf{f}_n^{\top } (\ln \mathbf{u}^{(K)} -\ln \mathbf{u}^{H (K)}) | + | \mathbf{g}_n^{\top } (\ln \mathbf{v}^{(K)} -\ln \mathbf{v}^{H (K)}) | \right) \\&= 2 \, \varepsilon \, \left( | \mathbf{f}_n^{\top } \, \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) | + | \mathbf{g}_n^{\top } \, \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) | \right) , \end{aligned}$$

where in the second inequality, we have assumed without loss of generality that the approximation error in the entropic optimal cost of \((\mathbf{f}_n, \mathbf{g}_n)\) is larger than or equal to that of \((\mathbf{f}_n, \mathbf{f}_n)\) and \((\mathbf{g}_n, \mathbf{g}_n)\). By Cauchy-Schwarz inequality \(|\mathbf{v}^{\top } \mathbf{w}| \le || \mathbf{v} ||_2 \, || \mathbf{w} ||_2\), we then get

$$\begin{aligned} \varepsilon _{II} \le 2 \, \varepsilon \, \Bigl ( || \mathbf{f}_n ||_2 \, || \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) ||_2 + || \mathbf{g}_n ||_2 \, || \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) ||_2 \Bigr ). \end{aligned}$$
(6.6)

We next find an upper bound for \(|| \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)}) ||_2\). By the first Sinkhorn’s formula (2.10), we have

$$\begin{aligned}&\mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} = \bigl ( \mathbf{f}_n \, \oslash \, (Q \, \mathbf{v}^{(i-1)}) \bigr ) \oslash \bigl ( \mathbf{f}_n \, \oslash \, (Q^H \, \mathbf{v}^{H (i-1)}) \bigr ) = {\mathbb {1}}_n \\&+ \bigl ( Q^H \, \mathbf{v}^{H (i-1)} - Q \, \mathbf{v}^{(i-1)}\bigr ) \oslash (Q \, \mathbf{v}^{(i-1)}). \end{aligned}$$

Then, by the logarithm inequality \(\ln (1+\beta ) \le \beta \), for \(\beta > -1\), we get

$$\begin{aligned} | \ln ( \mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} ) | \le | Q^H \, \mathbf{v}^{H (i-1)} - Q \, \mathbf{v}^{(i-1)} | \oslash | Q \, \mathbf{v}^{(i-1)}| \le \frac{1}{\omega } \, | Q^H \, \mathbf{v}^{H (i-1)} - Q \, \mathbf{v}^{(i-1)} |, \end{aligned}$$

with \(\omega >0\) as in the statement of the lemma. Hence,

$$\begin{aligned} || \ln ( \mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} ) ||_2&\le \frac{1}{\omega } \, || Q \, \mathbf{v}^{(i-1)}) - Q^H \, \mathbf{v}^{H (i-1)} ||_2 \nonumber \\&\le \frac{1}{\omega } \, || Q \, \mathbf{v}^{(i-1)} - Q^H \, \mathbf{v}^{(i-1)} ||_2 + \frac{1}{\omega } \, || Q^H \, \mathbf{v}^{(i-1)} - Q^H \, \mathbf{v}^{H (i-1)} ||_2 \nonumber \\&\le \frac{1}{\omega } \, h(d) \, || \mathbf{v}^{(i-1)} ||_2 \, \hat{\varepsilon } + \frac{1}{\omega } \, || Q^H ||_F \, || \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} ||_2, \end{aligned}$$
(6.7)

where the second inequality is the triangle inequality, and the third inequality follows from Lemma 4.1 and (4.5) and the compatibility of the Frobenius matrix norm with Euclidean vector norm \(||A \mathbf{v} ||_2 \le ||A||_{\text {F}} ||\mathbf{v}||_2\). To further bound the second term of the above inequality, we use the second Sinkhorn’s formula (2.10) and write

$$\begin{aligned} \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)}&= \mathbf{g}_n \, \oslash \, (Q^{\top } \, \mathbf{u}^{(i-1)}) - \mathbf{g}_n \, \oslash \, (Q^{H^{\top }} \, \mathbf{u}^{H (i-1)})\\&= \Bigl ( \mathbf{g}_n \, \oslash \bigl ( (Q^{\top } \, \mathbf{u}^{(i-1)}) \odot (Q^{H^{\top }} \, \mathbf{u}^{H (i-1)}) \bigr ) \Bigr ) \odot (Q^{\top } \, \mathbf{u}^{(i-1)}\\&\quad - Q^{H^{\top }} \, \mathbf{u}^{H (i-1)}). \end{aligned}$$

Hence, noting that \(\max \mathbf{g}_n \le 1\), we obtain

$$\begin{aligned} | \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} | \le \frac{1}{\omega ^2} \, | Q^{\top } \, \mathbf{u}^{(i-1)} - Q^{H^{\top }} \, \mathbf{u}^{H (i-1)}|. \end{aligned}$$

By the triangle inequality and Lemma 4.1 and (4.5), we then get

$$\begin{aligned} || \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} ) ||_2&\le \frac{1}{\omega ^2} \, || Q^{\top } \, \mathbf{u}^{(i-1)} - Q^{H^{\top }} \, \mathbf{u}^{H(i-1)} ||_2 \\&\le \frac{1}{\omega ^2} \, || Q^{\top } \, \mathbf{u}^{(i-1)} - Q^{H^{\top }} \, \mathbf{u}^{(i-1)} ||_2 + \frac{1}{\omega ^2} \, || Q^{H^{\top }} \, \mathbf{u}^{(i-1)} \\&\quad - Q^{H^{\top }} \, \mathbf{u}^{H (i-1)} ||_2 \\&\le \frac{1}{\omega ^2} \, h(d) \, || \mathbf{u}^{(i-1)} ||_2 \, \hat{\varepsilon } + \frac{1}{\omega ^2} \, || Q^H ||_F \, || \mathbf{u}^{(i-1)} - \mathbf{u}^{H (i-1)} ||_2. \end{aligned}$$

Using the logarithm inequality \(1-1/\beta \le \ln \beta \) for \(\beta >0\), and setting \(\beta \) to be the elements of the vector \(\mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)}\), we have

$$\begin{aligned} || \mathbf{u}^{(i-1)} - \mathbf{u}^{H (i-1)} ||_2\le & {} \max \mathbf{u}^{(i-1)} \, || \ln ( \mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)} ) ||_2\\\le & {} \frac{1}{\omega } \, || \ln ( \mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)} ) ||_2, \end{aligned}$$

where we have used \(\max \mathbf{u}^{(i-1)} \le 1 / \omega \), which follows by (2.10) and since \(\max \mathbf{f}_n \le 1\). Noting that \(|| \mathbf{u}^{(i-1)} ||_2 \le n / \omega \), the last two inequalities above therefore give us

$$\begin{aligned} || \mathbf{v}^{(i-1)} - \mathbf{v}^{H (i-1)} ) ||_2 \le \frac{n}{\omega ^3} \, h(d) \, \hat{\varepsilon } + \frac{1}{\omega ^3} \, || Q^H ||_F \, || \ln ( \mathbf{u}^{(i-1)} \oslash \mathbf{u}^{H (i-1)} ) ||_2.\nonumber \\ \end{aligned}$$
(6.8)

Now, setting

$$\begin{aligned} \delta _\mathbf{u}^{(i)} := || \ln ( \mathbf{u}^{(i)} \oslash \mathbf{u}^{H (i)} ) ||_2, \end{aligned}$$

by (6.7) and (6.8), and noting that \(|| \mathbf{v}^{(i-1)} ||_2 \le n / \omega \), we obtain the recursive formula

$$\begin{aligned} \delta _\mathbf{u}^{(i)} \le c_1 \, \Bigl ( n \, \hat{\varepsilon } + n \, \hat{\varepsilon } \, || Q^H ||_F + || Q^H ||_F^2 \, \delta _\mathbf{u}^{(i-1)}\Bigr ), \end{aligned}$$

where \(c_1\) is a constant that depends only on \(\omega \) and d. Noting that \(\delta _\mathbf{u}^{(0)} = 0\), we finally obtain

$$\begin{aligned} \delta _\mathbf{u}^{(K)} \le c_2 \, n \, \hat{\varepsilon } \, \Bigl ( 1+ || Q^H ||_F + || Q^H ||_F^2 + \dotsc + || Q^H ||_F^K \Bigr ), \end{aligned}$$

where \(c_2\) is a constant that depends only on \(c_1\). Finally, by the sum of a finite geometric series, we get

$$\begin{aligned} \delta _\mathbf{u}^{(K)} = || \ln ( \mathbf{u}^{(K)} \oslash \mathbf{u}^{H (K)} ) ||_2 \le c_3 \, n \, \hat{\varepsilon } \, || Q^H ||_F^K, \end{aligned}$$
(6.9)

where \(c_3\) is a constant that depends only on \(c_2\), that is, on \(\omega \) and d. The desired estimate (4.6) follows by (6.6) and (6.9), and noting that a similar estimate as (6.9) holds for \(\delta _\mathbf{v}^{(K)} := || \ln ( \mathbf{v}^{(K)} \oslash \mathbf{v}^{H (K)}) ||_2\). \(\square \)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Motamed, M. A hierarchically low-rank optimal transport dissimilarity measure for structured data. Bit Numer Math 62, 1945–1982 (2022). https://doi.org/10.1007/s10543-022-00937-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10543-022-00937-9

Keywords

Mathematics Subject Classification

Navigation