# Better and simpler error analysis of the Sinkhorn–Knopp algorithm for matrix scaling

## Abstract

Given a non-negative $$n \times m$$ real matrix A, the matrix scaling problem is to determine if it is possible to scale the rows and columns so that each row and each column sums to a specified positive target values. The Sinkhorn–Knopp algorithm is a simple and classic procedure which alternately scales all rows and all columns to meet these targets. The focus of this paper is the worst-case theoretical analysis of this algorithm. We present an elementary convergence analysis for this algorithm that improves upon the previous best bound. In a nutshell, our approach is to show (i) a simple bound on the number of iterations needed so that the KL-divergence between the current row-sums and the target row-sums drops below a specified threshold $$\delta$$, and (ii) then show that for a suitable choice of $$\delta$$, whenever KL-divergence is below $$\delta$$, then the $$\ell _1$$-error or the $$\ell _2$$-error is below $$\varepsilon$$. The well-known Pinsker’s inequality immediately allows us to translate a bound on the KL divergence to a bound on $$\ell _1$$-error. To bound the $$\ell _2$$-error in terms of the KL-divergence, we establish a new inequality, referred to as (KL vs $$\ell _1/\ell _2$$). This inequality is a strengthening of Pinsker’s inequality and may be of independent interest.

This is a preview of subscription content, access via your institution.

1. 1.

Computationally, this asymptotic viewpoint is unavoidable in the sense that there are simple examples for which the unique matrix scaling matrices need to have irrational entries. For instance, consider the following example from Rothblum and Schneider . The matrix is $$\begin{bmatrix} 1 &{} 1 \\ 1 &{} 2 \end{bmatrix}$$ with $${\mathbf {r}}\equiv {\mathbf {c}}\equiv [1,1]^\top$$. The unique (up to scaling) R and S matrices are $$\begin{bmatrix} (\sqrt{2}+1)^{-1} &{} 0 \\ 0 &{} (\sqrt{2}+2)^{-1} \end{bmatrix}$$ and $$\begin{bmatrix} \sqrt{2} &{} 0 \\ 0 &{} 1 \end{bmatrix}$$, respectively, giving $$RAS = \begin{bmatrix} 2 - \sqrt{2} &{} \sqrt{2} - 1 \\ \sqrt{2} - 1 &{} 2 - \sqrt{2} \end{bmatrix}$$.

2. 2.

After the first version of this paper was made public, we were pointed to concurrent work by Altschuler, Weed and Rigollet  studying the $$\ell _1$$-error and obtaining the same result as part 1 of Theorem 1.

3. 3.

 never make the base of the logarithm explicit, but their proof shows it can be as large as $$1 - 1/\nu ^2$$.

4. 4.

The KL-divergence is normally stated between two distributions and doesn’t have the 1/h factor. Also the logarithms are usually base 2.

## References

1. 1.

Aaronson, S.: Quantum computing and hidden variables. Phys. Rev. A 71, 032325 (2005)

2. 2.

Allen Zhu, Z., Li, Y., Oliveira, R., Wigderson, A.: Much faster algorithms for matrix scaling. In: 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS (2017)

3. 3.

Allen Zhu, Z., Orecchia, L.: Using optimization to break the epsilon barrier: a faster and simpler width-independent algorithm for solving positive linear programs in parallel. In: Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pp. 1439–1456, San Diego, CA, USA (2015)

4. 4.

Altschuler, J., Weed, J., Rigollet, P.: Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, pp. 1961–1971, Long Beach, CA, USA (2017)

5. 5.

Bacharach, M.: Estimating nonnegative matrices from marginal data. Int. Econ. Rev. 6(3), 294–310 (1965)

6. 6.

Balakrishnan, H., Hwang, I., Tomlin, C.J.: Polynomial approximation algorithms for belief matrix maintenance in identity management. In: 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601), vol. 5, pp. 4874–4879 (2004)

7. 7.

Bapat, R., Raghavan, T.: An extension of a theorem of Darroch and Ratcliff in loglinear models and its application to scaling multidimensional matrices. Linear Algebra Appl. 114, 705–715 (1989). Special Issue Dedicated to Alan J. Hoffman

8. 8.

Bregman, L.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)

9. 9.

Brualdi, R.A., Parter, S.V., Schneider, H.: The diagonal equivalence of a nonnegative matrix to a stochastic matrix. J. Math. Anal. Appl. 16(1), 31–50 (1966)

10. 10.

Cohen, M.B., Madry, A., Tsipras, D., Vladu, A.: Matrix scaling and balancing via box constrained newton’s method and interior point methods. In: 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS (2017)

11. 11.

Csiszar, I.: I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 3(1), 146–158 (1975)

12. 12.

Csiszar, I.: A geometric interpretation of Darroch and Ratcliff’s generalized iterative scaling. Ann. Stat. 17(3), 1409–1413 (1989)

13. 13.

Deming, W.E., Stephan, F.F.: On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat. 11(4), 427–444 (1940)

14. 14.

Franklin, J., Lorenz, J.: On the scaling of multidimensional matrices. Linear Algebra Appl. 114, 717–735 (1989). Special Issue Dedicated to Alan J. Hoffman

15. 15.

Gurvits, L., Yianilos, P.N.: The deflation–inflation method for certain semidefinite programming and maximum determinant completion problems. Technical report, NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 (1998)

16. 16.

Idel, M.: A review of matrix scaling and Sinkhorn’s normal form for matrices and positive maps (2016). ArXiv e-prints. arXiv:1609.06349

17. 17.

Kalantari, B., Khachiyan, L.: On the rate of convergence of deterministic and randomized RAS matrix scaling algorithms. Oper. Res. Lett. 14(5), 237–244 (1993)

18. 18.

Kalantari, B., Khachiyan, L.: On the complexity of nonnegative-matrix scaling. Linear Algebra Appl. 240, 87–103 (1996)

19. 19.

Kalantari, B., Lari, I., Ricca, F., Simeone, B.: On the complexity of general matrix scaling and entropy minimization via the RAS algorithm. Technical Report, no. 24. Department of Statistics and Applied probability, La Sapienza University, Rome (2002)

20. 20.

Kalantari, B., Lari, I., Ricca, F., Simeone, B.: On the complexity of general matrix scaling and entropy minimization via the RAS algorithm. Math. Program. 112(2), 371–401 (2008)

21. 21.

Knight, P.A.: The Sinkhorn–Knopp algorithm: convergence and applications. SIAM J. Matrix Anal. Appl. 30(1), 261–275 (2008)

22. 22.

Linial, N., Samorodnitsky, A., Wigderson, A.: A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. Combinatorica 20(4), 545–568 (2000)

23. 23.

Luby, M., Nisan, N.: A parallel approximation algorithm for positive linear programming. In: Proceedings of the Twenty-fifth Annual ACM Symposium on Theory of Computing, STOC ’93, pp. 448–457. ACM, New York, NY, USA (1993)

24. 24.

Macgill, S.M.: Theoretical properties of biproportional matrix adjustments. Environ. Plan. A 9(6), 687–701 (1977)

25. 25.

Mahoney, M.W., Rao, S., Wang, D., Zhang, P.: Approximating the solution to mixed packing and covering LPs in parallel $$O(\epsilon ^{-3})$$ time. In: 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11–15, 2016, pp. 52:1–52:14, Rome, Italy (2016)

26. 26.

Menon, M.: Reduction of a matrix with positive elements to a doubly stochastic matrix. Proc. Am. Math. Soc. 18(2), 244–247 (1967)

27. 27.

Ortúzar, J d D, Willumsen, L .G.: Modelling Transport. Wiley, New York (2011)

28. 28.

Pollard, D.: A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2001)

29. 29.

Raghavan, T.: On pairs of multidimensional matrices. Linear Algebra Appl. 62, 263–268 (1984)

30. 30.

Reiss, R.-D.: Approximate Distributions of Order Statistics. Springer, Berlin (1989)

31. 31.

Rote, G., Zachariasen, M.: Matrix scaling by network flow. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp. 848–854. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2007)

32. 32.

Rothblum, U., Schneider, H.: Scalings of matrices which have prespecified row sums and column sums via optimization. Linear Algebra Appl. 114, 737–764 (1989). (Special Issue Dedicated to Alan J. Hoffman)

33. 33.

Ruschendorf, L.: Convergence of the iterative proportional fitting procedure. Ann. Stat. 23(4), 1160–1174 (1995)

34. 34.

Sason, I., Verdú, S.: Upper bounds on the relative entropy and Rényi divergence as a function of total variation distance for finite alphabets. In: 2015 IEEE Information Theory Workshop—Fall (ITW), Jeju Island, South Korea, October 11–15, 2015, pp. 214–218. IEEE (2015)

35. 35.

Schrödinger, E.: Über die umkehrung der naturgesetze. Preuss. Akad. Wiss., Phys.-Math. Kl, pp. 412–422 (1931)

36. 36.

Sinkhorn, R.: Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 74(4), 402–405 (1967)

37. 37.

Sinkhorn, R., Knopp, P.: Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 21(2), 343–348 (1967)

38. 38.

Soules, G.W.: The rate of convergence of Sinkhorn balancing. Linear Algebra Appl. 150, 3–40 (1991)

39. 39.

Xi’an: Statistics stack exchange. https://stats.stackexchange.com/questions/130432/differences-between-bhattacharyya-distance-and-kl-divergence (2014). Accessed 15 Jan 2018

40. 40.

Young, N.E.: Sequential and parallel algorithms for mixed packing and covering. In: 42nd Annual Symposium on Foundations of Computer Science, FOCS, pp. 538–546, Las Vegas, Nevada, USA (2001)

## Acknowledgements

We thank Daniel Dadush for asking the connection between our inequality and Hellinger distance, and Jonathan Weed for letting us know of . We would also like to thank the anonymous reviewers for insightful suggestions.

## Author information

Authors

### Corresponding author

Correspondence to Deeparnab Chakrabarty.