Skip to main content

Better and simpler error analysis of the Sinkhorn–Knopp algorithm for matrix scaling

Abstract

Given a non-negative \(n \times m\) real matrix A, the matrix scaling problem is to determine if it is possible to scale the rows and columns so that each row and each column sums to a specified positive target values. The Sinkhorn–Knopp algorithm is a simple and classic procedure which alternately scales all rows and all columns to meet these targets. The focus of this paper is the worst-case theoretical analysis of this algorithm. We present an elementary convergence analysis for this algorithm that improves upon the previous best bound. In a nutshell, our approach is to show (i) a simple bound on the number of iterations needed so that the KL-divergence between the current row-sums and the target row-sums drops below a specified threshold \(\delta \), and (ii) then show that for a suitable choice of \(\delta \), whenever KL-divergence is below \(\delta \), then the \(\ell _1\)-error or the \(\ell _2\)-error is below \(\varepsilon \). The well-known Pinsker’s inequality immediately allows us to translate a bound on the KL divergence to a bound on \(\ell _1\)-error. To bound the \(\ell _2\)-error in terms of the KL-divergence, we establish a new inequality, referred to as (KL vs \(\ell _1/\ell _2\)). This inequality is a strengthening of Pinsker’s inequality and may be of independent interest.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    Computationally, this asymptotic viewpoint is unavoidable in the sense that there are simple examples for which the unique matrix scaling matrices need to have irrational entries. For instance, consider the following example from Rothblum and Schneider [32]. The matrix is \(\begin{bmatrix} 1 &{} 1 \\ 1 &{} 2 \end{bmatrix}\) with \({\mathbf {r}}\equiv {\mathbf {c}}\equiv [1,1]^\top \). The unique (up to scaling) R and S matrices are \(\begin{bmatrix} (\sqrt{2}+1)^{-1} &{} 0 \\ 0 &{} (\sqrt{2}+2)^{-1} \end{bmatrix}\) and \(\begin{bmatrix} \sqrt{2} &{} 0 \\ 0 &{} 1 \end{bmatrix}\), respectively, giving \(RAS = \begin{bmatrix} 2 - \sqrt{2} &{} \sqrt{2} - 1 \\ \sqrt{2} - 1 &{} 2 - \sqrt{2} \end{bmatrix}\).

  2. 2.

    After the first version of this paper was made public, we were pointed to concurrent work by Altschuler, Weed and Rigollet [4] studying the \(\ell _1\)-error and obtaining the same result as part 1 of Theorem 1.

  3. 3.

    [14] never make the base of the logarithm explicit, but their proof shows it can be as large as \(1 - 1/\nu ^2\).

  4. 4.

    The KL-divergence is normally stated between two distributions and doesn’t have the 1/h factor. Also the logarithms are usually base 2.

References

  1. 1.

    Aaronson, S.: Quantum computing and hidden variables. Phys. Rev. A 71, 032325 (2005)

    MathSciNet  Article  Google Scholar 

  2. 2.

    Allen Zhu, Z., Li, Y., Oliveira, R., Wigderson, A.: Much faster algorithms for matrix scaling. In: 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS (2017)

  3. 3.

    Allen Zhu, Z., Orecchia, L.: Using optimization to break the epsilon barrier: a faster and simpler width-independent algorithm for solving positive linear programs in parallel. In: Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1439–1456, San Diego, CA, USA (2015)

  4. 4.

    Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, pp. 1961–1971, Long Beach, CA, USA (2017)

  5. 5.

    Bacharach, M.: Estimating nonnegative matrices from marginal data. Int. Econ. Rev. 6(3), 294–310 (1965)

    Article  Google Scholar 

  6. 6.

    Balakrishnan, H., Hwang, I., Tomlin, C.J.: Polynomial approximation algorithms for belief matrix maintenance in identity management. In: 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601), vol. 5, pp. 4874–4879 (2004)

  7. 7.

    Bapat, R., Raghavan, T.: An extension of a theorem of Darroch and Ratcliff in loglinear models and its application to scaling multidimensional matrices. Linear Algebra Appl. 114, 705–715 (1989). Special Issue Dedicated to Alan J. Hoffman

    MathSciNet  Article  Google Scholar 

  8. 8.

    Bregman, L.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Brualdi, R.A., Parter, S.V., Schneider, H.: The diagonal equivalence of a nonnegative matrix to a stochastic matrix. J. Math. Anal. Appl. 16(1), 31–50 (1966)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Cohen, M.B., Madry, A., Tsipras, D., Vladu, A.: Matrix scaling and balancing via box constrained newton’s method and interior point methods. In: 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS (2017)

  11. 11.

    Csiszar, I.: I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 3(1), 146–158 (1975)

    MathSciNet  Article  Google Scholar 

  12. 12.

    Csiszar, I.: A geometric interpretation of Darroch and Ratcliff’s generalized iterative scaling. Ann. Stat. 17(3), 1409–1413 (1989)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Deming, W.E., Stephan, F.F.: On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat. 11(4), 427–444 (1940)

    MathSciNet  Article  Google Scholar 

  14. 14.

    Franklin, J., Lorenz, J.: On the scaling of multidimensional matrices. Linear Algebra Appl. 114, 717–735 (1989). Special Issue Dedicated to Alan J. Hoffman

    MathSciNet  Article  Google Scholar 

  15. 15.

    Gurvits, L., Yianilos, P.N.: The deflation–inflation method for certain semidefinite programming and maximum determinant completion problems. Technical report, NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 (1998)

  16. 16.

    Idel, M.: A review of matrix scaling and Sinkhorn’s normal form for matrices and positive maps (2016). ArXiv e-prints. arXiv:1609.06349

  17. 17.

    Kalantari, B., Khachiyan, L.: On the rate of convergence of deterministic and randomized RAS matrix scaling algorithms. Oper. Res. Lett. 14(5), 237–244 (1993)

    MathSciNet  Article  Google Scholar 

  18. 18.

    Kalantari, B., Khachiyan, L.: On the complexity of nonnegative-matrix scaling. Linear Algebra Appl. 240, 87–103 (1996)

    MathSciNet  Article  Google Scholar 

  19. 19.

    Kalantari, B., Lari, I., Ricca, F., Simeone, B.: On the complexity of general matrix scaling and entropy minimization via the RAS algorithm. Technical Report, no. 24. Department of Statistics and Applied probability, La Sapienza University, Rome (2002)

  20. 20.

    Kalantari, B., Lari, I., Ricca, F., Simeone, B.: On the complexity of general matrix scaling and entropy minimization via the RAS algorithm. Math. Program. 112(2), 371–401 (2008)

    MathSciNet  Article  Google Scholar 

  21. 21.

    Knight, P.A.: The Sinkhorn–Knopp algorithm: convergence and applications. SIAM J. Matrix Anal. Appl. 30(1), 261–275 (2008)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Linial, N., Samorodnitsky, A., Wigderson, A.: A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. Combinatorica 20(4), 545–568 (2000)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Luby, M., Nisan, N.: A parallel approximation algorithm for positive linear programming. In: Proceedings of the Twenty-fifth Annual ACM Symposium on Theory of Computing, STOC ’93, pp. 448–457. ACM, New York, NY, USA (1993)

  24. 24.

    Macgill, S.M.: Theoretical properties of biproportional matrix adjustments. Environ. Plan. A 9(6), 687–701 (1977)

    Article  Google Scholar 

  25. 25.

    Mahoney, M.W., Rao, S., Wang, D., Zhang, P.: Approximating the solution to mixed packing and covering LPs in parallel \(O(\epsilon ^{-3})\) time. In: 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11–15, 2016, pp. 52:1–52:14, Rome, Italy (2016)

  26. 26.

    Menon, M.: Reduction of a matrix with positive elements to a doubly stochastic matrix. Proc. Am. Math. Soc. 18(2), 244–247 (1967)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Ortúzar, J d D, Willumsen, L .G.: Modelling Transport. Wiley, New York (2011)

    Book  Google Scholar 

  28. 28.

    Pollard, D.: A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2001)

    Book  Google Scholar 

  29. 29.

    Raghavan, T.: On pairs of multidimensional matrices. Linear Algebra Appl. 62, 263–268 (1984)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Reiss, R.-D.: Approximate Distributions of Order Statistics. Springer, Berlin (1989)

    Book  Google Scholar 

  31. 31.

    Rote, G., Zachariasen, M.: Matrix scaling by network flow. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp. 848–854. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2007)

  32. 32.

    Rothblum, U., Schneider, H.: Scalings of matrices which have prespecified row sums and column sums via optimization. Linear Algebra Appl. 114, 737–764 (1989). (Special Issue Dedicated to Alan J. Hoffman)

    MathSciNet  Article  Google Scholar 

  33. 33.

    Ruschendorf, L.: Convergence of the iterative proportional fitting procedure. Ann. Stat. 23(4), 1160–1174 (1995)

    MathSciNet  Article  Google Scholar 

  34. 34.

    Sason, I., Verdú, S.: Upper bounds on the relative entropy and Rényi divergence as a function of total variation distance for finite alphabets. In: 2015 IEEE Information Theory Workshop—Fall (ITW), Jeju Island, South Korea, October 11–15, 2015, pp. 214–218. IEEE (2015)

  35. 35.

    Schrödinger, E.: Über die umkehrung der naturgesetze. Preuss. Akad. Wiss., Phys.-Math. Kl, pp. 412–422 (1931)

  36. 36.

    Sinkhorn, R.: Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 74(4), 402–405 (1967)

    MathSciNet  Article  Google Scholar 

  37. 37.

    Sinkhorn, R., Knopp, P.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21(2), 343–348 (1967)

    MathSciNet  Article  Google Scholar 

  38. 38.

    Soules, G.W.: The rate of convergence of Sinkhorn balancing. Linear Algebra Appl. 150, 3–40 (1991)

    MathSciNet  Article  Google Scholar 

  39. 39.

    Xi’an: Statistics stack exchange. https://stats.stackexchange.com/questions/130432/differences-between-bhattacharyya-distance-and-kl-divergence (2014). Accessed 15 Jan 2018

  40. 40.

    Young, N.E.: Sequential and parallel algorithms for mixed packing and covering. In: 42nd Annual Symposium on Foundations of Computer Science, FOCS, pp. 538–546, Las Vegas, Nevada, USA (2001)

Download references

Acknowledgements

We thank Daniel Dadush for asking the connection between our inequality and Hellinger distance, and Jonathan Weed for letting us know of [4]. We would also like to thank the anonymous reviewers for insightful suggestions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Deeparnab Chakrabarty.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The second author’s work was supported in part by the National Science Foundation Grants CCF-1552909 and CCF-1617851. A preliminary version of this work was presented in the 1st Symposium on Simplicity in Algorithms, January 10th, 2018, co-located with SODA 2018 in New Orleans.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chakrabarty, D., Khanna, S. Better and simpler error analysis of the Sinkhorn–Knopp algorithm for matrix scaling. Math. Program. 188, 395–407 (2021). https://doi.org/10.1007/s10107-020-01503-3

Download citation

Keywords

  • Matrix scaling
  • Alternate minimization
  • KL divergence
  • Matchings

Mathematics Subject Classification

  • 68Q25
  • 68W40