Skip to main content
Log in

SKCompress: compressing sparse and nonuniform gradient in distributed machine learning

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Distributed machine learning (ML) has been extensively studied to meet the explosive growth of training data. A wide range of machine learning models are trained by a family of first-order optimization algorithms, i.e., stochastic gradient descent (SGD). The core operation of SGD is the calculation of gradients. When executing SGD in a distributed environment, the workers need to exchange local gradients through the network. In order to reduce the communication cost, a category of quantification-based compression algorithms are used to transform the gradients to binary format, at the expense of a low precision loss. Although the existing approaches work fine for dense gradients, we find that these methods are ill-suited for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression framework that can efficiently handle sparse and nonuniform gradients? We propose a general compression framework, called SKCompress, to compress both gradient values and gradient keys in sparse gradients. Our first contribution is a sketch-based method that compresses the gradient values. Sketch is a class of algorithm that approximates the distribution of a data stream with a probabilistic data structure. We first use a quantile sketch to generate splits, sort gradient values into buckets, and encode them with the bucket indexes. Our second contribution is a new sketch algorithm, namely MinMaxSketch, which compresses the bucket indexes. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. Since the bucket indexes are nonuniform, we further adopt Huffman coding to compress MinMaxSketch. To compress the keys of sparse gradients, the third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and encode them with binary format. An adaptive prefix is proposed to assign different sizes to different gradient keys, so that we can save more space. We also theoretically discuss the correctness and the error bound of our proposed methods. To the best of our knowledge, this is the first effort utilizing data sketch to compress gradients in ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc. and show that our method is up to \(12\times \) faster than the existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. Gradient compression can bring larger speedup for all-reduce systems than parameter-server systems, owing to the single bottleneck problem of the driver node. Parameter-sever systems accelerate communication by using more machines and larger bandwidth to aggregate the gradients. Following the setting of previous works on gradient compression, our work tries to compress gradients in all-reduce systems without using more machines.

  2. The sparsity of the aggregated gradient remains unchanged since the total batch size of all workers is the same.

References

  1. Alistarh, D., Li, J., Tomioka, R., Vojnovic, M.: Qsgd: randomized quantization for communication-optimal stochastic gradient descent. arXiv preprint. arXiv:1610.02132 (2016)

  2. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)

    Article  Google Scholar 

  3. Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on cuda. Tech. rep., Nvidia Corporation (2008)

  4. Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)

    Article  Google Scholar 

  5. Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186 (2010)

  6. Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436 (2012)

  7. Bubeck, S., et al.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)

    Article  MATH  Google Scholar 

  8. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

  9. Cifar: Cifar dataset. https://www.cs.toronto.edu/~kriz/cifar.html

  10. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  11. Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

  12. Deutsch, P.: Deflate compressed data format specification version 1.3. Tech. rep. (1996)

  13. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 58–66 (2001)

    Article  Google Scholar 

  15. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (iot): a vision, architectural elements, and future directions. Future Gener. Comput. Syst. 29(7), 1645–1660 (2013)

    Article  Google Scholar 

  16. Hinds, S.C., Fisher, J.L., D’Amato, D.P.: A document skew detection method using run-length encoding and the hough transform. In: Pattern Recognition, 1990. Proceedings., 10th International Conference on, Vol. 1, pp. 464–468. IEEE (1990)

  17. Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J.K., Gibbons, P.B., Gibson, G.A., Ganger, G., Xing, E.P.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2013)

  18. Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression, vol. 398. Wiley, New York (2013)

    Book  MATH  Google Scholar 

  19. Huang, Y., Cui, B., Zhang, W., Jiang, J., Xu, Y.: Tencentrec: real-time stream recommendation in practice. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 227–238. ACM (2015)

  20. Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478. ACM (2017)

  21. Jiang, J., Huang, M., Jiang, J., Cui, B.: Teslaml: steering machine learning automatically in tencent. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, pp. 313–318. Springer (2017)

  22. Jiang, J., Jiang, J., Cui, B., Zhang, C.: Tencentboost: a gradient boosting tree system with parameter server. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 281–284 (2017)

  23. Jiang, J., Tong, Y., Lu, H., Cui, B., Lei, K., Yu, L.: Gvos: a general system for near-duplicate video-related applications on storm. ACM Trans. Inf. Syst. (TOIS) 36(1), 3 (2017)

    Article  Google Scholar 

  24. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

  25. KDD: Kdd cup 2010 (2010). http://www.kdd.org/kdd-cup/

  26. KDD: Kdd cup 2012 (2012). https://www.kaggle.com/c/kddcup2012-track1

  27. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980 (2014)

  28. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint. arXiv:1609.02907 (2016)

  29. Knuth, D.E.: Dynamic Huffman coding. J. Algorithms 6(2), 163–180 (1985)

    MathSciNet  MATH  Google Scholar 

  30. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances In Neural Information Processing Systems, pp. 1097–1105 (2012)

  31. Li, B., Drozd, A., Guo, Y., Liu, T., Matsuoka, S., Du, X.: Scaling word2vec on big corpus. Data Sci. Eng. 1–19 (2019)

  32. Li, M., Liu, Z., Smola, A.J., Wang, Y.X.: Difacto: distributed factorization machines. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 377–386. ACM (2016)

  33. Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 661–670. ACM (2014)

  34. Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint. arXiv:1509.02971 (2015)

  35. Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint. arXiv:1712.01887 (2017)

  36. McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. In: Advances in Neural Information Processing Systems, pp. 2915–2923 (2014)

  37. Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)

  38. Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks. arXiv preprint. arXiv:1511.06807 (2015)

  39. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  40. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/k\(^\wedge \) 2). Doklady AN USSR 269, 543–547 (1983)

    Google Scholar 

  41. Parnell, T., Dünner, C., Atasu, K., Sifalakis, M., Pozidis, H.: Large-scale stochastic learning using gpus. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 419–428 (2017)

  42. Parnell, T., Dünner, C., Atasu, K., Sifalakis, M., Pozidis, H.: Tera-scale coordinate descent on gpus. Future Gener. Comput. Syst. (2018)

  43. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)

    Article  MathSciNet  Google Scholar 

  44. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. arXiv preprint. arXiv:1904.09237 (2019)

  45. Rendle, S., Fetterly, D., Shekita, E.J., Su, B.y.: Robust large-scale machine learning in the cloud. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1125–1134 (2016)

  46. Seber, G.A., Lee, A.J.: Linear Regression Analysis, vol. 936. Wiley, Hoboken (2012)

    MATH  Google Scholar 

  47. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In: INTERSPEECH, pp. 1058–1062 (2014)

  48. Stich, S.U., Cordonnier, J.B., Jaggi, M.: Sparsified sgd with memory. In: Advances in Neural Information Processing Systems, pp. 4447–4458 (2018)

  49. Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)

    Article  Google Scholar 

  50. Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: Advances in Neural Information Processing Systems, pp. 7652–7662 (2018)

  51. Tewarson, R.P.: Sparse Matrices. Academic Press, New York (1973)

    MATH  Google Scholar 

  52. Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., Wright, S.: Atomo: communication-efficient learning via atomic sparsification. In: Advances in Neural Information Processing Systems, pp. 9850–9861 (2018)

  53. Wang, Y., Lin, P., Hong, Y.: Distributed regression estimation with incomplete data in multi-agent networks. Sci. China Inf. Sci. 61(9), 092202 (2018)

    Article  MathSciNet  Google Scholar 

  54. Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., Li, H.: Terngrad: ternary gradients to reduce communication in distributed deep learning. In: Advances in Neural Information Processing Systems, pp. 1509–1519 (2017)

  55. Wu, J., Huang, W., Huang, J., Zhang, T.: Error compensated quantized sgd and its applications to large-scale distributed optimization. arXiv preprint. arXiv:1806.08054 (2018)

  56. Yahoo: Data sketches (2004). https://datasketches.github.io/

  57. Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou, Y., Miao, R., Li, X., Uhlig, S.: Elastic sketch: adaptive and fast network-wide measurements. In: ACM SIGCOMM, pp. 561–575 (2018)

  58. Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB Endow. 9(5), 408–419 (2016)

    Article  Google Scholar 

  59. Yu, L., Zhang, C., Shao, Y., Cui, B.: Lda*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)

    Article  Google Scholar 

  60. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint. arXiv:1212.5701 (2012)

  61. Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 1283–1294 (2014)

    Article  Google Scholar 

  62. Zhang, H., Kara, K., Li, J., Alistarh, D., Liu, J., Zhang, C.: Zipml: an end-to-end bitwise framework for dense generalized linear models. arXiv:1611.05402 (2016)

  63. Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: Scientific and Statistical Database Management, 2007. 19th International Conference on SSBDM’07, pp. 29–29. IEEE (2007)

  64. Zheng, T., Chen, G., Wang, X., Chen, C., Wang, X., Luo, S.: Real-time intelligent big data processing: technology, platform, and applications. Sci. China Inf. Sci. 62(8), 82101 (2019)

    Article  Google Scholar 

  65. Zheng, Y., Zhang, L., Xie, X., Ma, W.Y.: Mining interesting locations and travel sequences from gps trajectories. In: Proceedings of the 18th International Conference on World Wide Web, pp. 791–800. ACM (2009)

  66. Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)

  67. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by NSFC (No. 61832001, 61702016, 61702015, 61572039, U1936104), the National Key Research and Development Program of China (No. 2018YFB1004403), Beijing Academy of Artificial Intelligence (BAAI), PKU-Tencent joint research Lab, and the project PCL Future “Regional Network Facilities for Large-scale Experiments and Applications under Grant PCL2018KP001”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Mathematical analysis of SKCompress

In this section, we theoretically analyze the correctness and the error bound of the three components of SKCompress.

1.1 Quantile-bucket quantification

1.1.1 Variance of stochastic gradients

A series of existing works has indicated that stochastic gradient descent (SGD) suffers from a slower convergence rate than gradient descent (GD) due to the inherent variance [39]. To be precise, we refer to Theorem 1.

Theorem 1

(Theorem 6.3 of [7]) Let f be convex and \(\theta ^*\) the optimal point. Choosing step length appropriately, the convergence rate of SGD is

$$\begin{aligned} {\mathbb {E}}\left[ f\left( \frac{1}{T} \sum _{t=1}^{T} \theta _{t+1}\right) - f(w^*) \right] \leqslant \varTheta \left( \frac{1}{T}+\frac{\sigma }{\sqrt{T}}\right) , \end{aligned}$$

where \(\sigma \) is the upper bound of mean variance

$$\begin{aligned} \sigma ^2 \geqslant \frac{1}{T} \sum _{t=1}^{T} {\mathbb {E}}||g_t - \nabla {f(\theta _t)}||^2. \end{aligned}$$

A key property of a stochastic gradient is the variance. Many methods are applied to reduce the variance, such as mini-batch [33], weight sampling [37], and SVRG [24].

We refer \({\tilde{g}}=\left\{ {\tilde{g}}_i\right\} _{i=1}^d\) to the quantificated gradient. Here, we abuse the notation that in Theorem 1 the subscript of \(g_t\) indicates the tth epoch to which it belongs, while in the following analysis that of \(g_i\) indicates the ith nonzero value of gradient. The variance of \({\tilde{g}}\) can be decomposed into

$$\begin{aligned} {\mathbb {E}}||{\tilde{g}}-\nabla {f(\theta )}||^2 \leqslant {\mathbb {E}}||{\tilde{g}}-g||^2 + {\mathbb {E}}||g-\nabla {f(\theta )}||^2. \end{aligned}$$

The second term comes from the stochastic gradient, which can be reduced by the methods mentioned above. Our goal is to find out a quantification method to make the first term as small as possible.

1.1.2 Variance bound of quantile-bucket quantification

In our framework, we use the quantile-bucket quantification method. For the sake of simplicity, we regard the maximum value in the gradient vector as the \((q+1)\)st quantile. The value range of gradients, denoted by \(\left[ \phi _{min}, \phi _{max}\right] \), is split into q intervals by \(q+1\) quantiles \(v=\left\{ v_j\right\} _{j=1}^{q+1}\). Since we separate positive and negative values and create one quantile sketch for each of them, we assume there is always a quantile split that equals to 0. Specifically, \(\phi _{min}=v_1<\cdots<v_{b_{zero}}=0<\cdots <v_{q+1}=\phi _{max}\). Also, we assume \(\left[ \phi _{min}, \phi _{max}\right] \subset \left[ -1, 1\right] \), otherwise we can use \(M(g)=||g||\) as the scaling factor.

Theorem 2

The variance \({\mathbb {E}}||{\tilde{g}}-g||^2\) introduced by quantile-bucket quantification is bounded by

$$\begin{aligned} \frac{d}{4q}(\phi _{min}^2+\phi _{max}^2), \end{aligned}$$

where \(\phi _{min}\) and \(\phi _{max}\) are the minimum and maximum values in the gradient vector: \(\phi _{min}=\min \left\{ g_i\right\} , \phi _{max}=\max \left\{ g_i\right\} \).

Proof

Using the quantiles as split values, the expected number of values that fall into the same interval should be \(\frac{d}{q}\), and for each \(g_i\),

$$\begin{aligned} \begin{aligned} ({\tilde{g}}_i - g_i)^2&= \left( \frac{1}{2}(v_{b(i)} + v_{b(i+1)}) - g_i\right) ^2\\&\leqslant \frac{1}{4} (v_{b(i+1)} - v_{b(i)})^2, \end{aligned} \end{aligned}$$

where b(i) is the index of bucket into which \(g_i\) falls. Thus, we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}||{\tilde{g}}-g||^2&= {\mathbb {E}} \left[ \sum _{i=1}^{d} ({\tilde{g}}_i - g_i)^2\right] \leqslant \frac{d}{4q}\sum _{j=1}^{q} (v_{j+1} - v_{j})^2 \\&= \frac{d}{4q} \left( \sum _{j=1}^{b_{zero}-1} (v_{j+1} - v_{j})^2 + \sum _{j=b_{zero}}^{q} (v_{j+1} - v_{j})^2\right) \\&\leqslant \frac{d}{4q} \left( \left( \sum _{j=1}^{b_{zero}-1} (v_{j+1} {-} v_{j})\right) ^2 {+} \left( \sum _{j=b_{zero}}^{q} (v_{j+1} {-} v_{j})\right) ^2\right) \\&= \frac{d}{4q} (\phi _{min}^2+\phi _{max}^2). \end{aligned} \end{aligned}$$
(1)

\(\square \)

Corollary 1

When the distribution of gradients is not biased, i.e., there exists \(\delta > 1\) such that \(\frac{||v||^2}{v_1^2+v_{q+1}^2} \geqslant \delta \), Eq. (1) is bounded by \(\frac{1}{4(\delta -1)}||g||^2\).

Proof

Obviously \(\phi _{min}^2+\phi _{max}^2 = v_1^2+v_{q+1}^2 \geqslant \frac{1}{\delta -1} \sum _{j=2}^{q}v_j^2\). Thus, we have

$$\begin{aligned} \frac{d}{4q} (\phi _{min}^2+\phi _{max}^2) \leqslant \frac{1}{4(\delta -1)} \sum _{j=2}^{q}\frac{d}{q}v_j^2 \leqslant \frac{1}{4(\delta -1)}||g||^2. \end{aligned}$$

Considering the most widely used uniform quantification method, Alistarh et al. proved the bound of its variance is \(\min (\frac{d}{q^2}, \frac{\sqrt{d}}{q})||g||^2\) [1]. Therefore, quantile-bucket quantification generates a better bound when d goes to infinite. \(\square \)

1.2 MinMaxSketch

1.2.1 Error bound of the MinMaxSketch

Let \(\alpha \) represent the average number of counters in any given array of the MinMaxSketch that are incremented per insertion. Note that for the standard CM-sketch, the value of \(\alpha \) is equal to 1 because in the standard CM-sketch, exactly one counter is incremented in each array when inserting an item. For the MinMaxSketch, \(\alpha \) is less than or equal to 1. For any given item e, let \(f_{(e)}\) represent its actual frequency and let \({\hat{f}}_{(e)}\) represent the estimate of its frequency returned by the MinMaxSketch. Let N represent the total number of insertions of all items into the MinMaxSketch. Let \(h_i(.)\) represent the hash function associated with the ith array of the MinMaxSketch, where \(1\leqslant i\leqslant d\). Let \(X_{i, (e)}[j]\) be the random variable that represents the difference between the actual frequency \(f_{(e)}\) of the item e and the value of the jth counter in the ith array, i.e., \(X_{i, (e)}[j] = A_i[j]- f_{(e)}\), where \(j=h_i(e)\). Due to hash collisions, multiple items will be mapped by the hash function \(h_i(.)\) to the counter j, which increases the value of \(A_i[j]\) beyond \(f_e\) and results in over-estimation error. As all hash functions have uniformly distributed output, \(Pr[h_i(e_1) = h_i(e_2)] = 1/w\). Therefore, the expected value of any counter \(A_i[j]\), where \(1\leqslant i\leqslant d\) and \(1\leqslant j\leqslant w\), is \(\alpha N/w\). Let \(\epsilon \) and \(\delta \) be two numbers that are related to d and w as follows: \(d = \lceil \ln (1/\delta )\rceil \) and \(w = \lceil \exp /\epsilon \rceil \). The expected value of \(X_{i, (e)}[j]\) is given by the following expression.

$$\begin{aligned} E(X_{i, (e)}[j]){=}E(A_i[j]{-} f_{(e)}) \leqslant E(A_i[j]) {=}\frac{\alpha N}{w} \leqslant \dfrac{\epsilon \alpha }{\exp }N. \end{aligned}$$

Finally, we derive the probabilistic bound on the over-estimation error of the MinMaxSketch.

$$\begin{aligned} \begin{aligned} Pr[\hat{f_{(e)}} \geqslant f_{(e)} {+} \epsilon \alpha N]&= Pr[\forall i, A_i[j] \geqslant f_{(e)} {+} \epsilon \alpha N]\\&= (Pr[A_i[j] {-} f_{(e)} \geqslant \epsilon \alpha N])^{d} \\&= (Pr[X_{i, (e)}[j]\geqslant \epsilon \alpha N]) ^ {d} \\&\leqslant (Pr[X_{i, (e)}[j]{\geqslant } \exp E(X_{i, (e)}[j]))^{d}\\&\leqslant \exp ^ {-d} \leqslant \delta . \end{aligned} \end{aligned}$$

1.2.2 The correctness rate of the MinMaxSketch

Next, we theoretically derive the correctness rate of the MinMaxSketch, which is defined as the expected percentage of elements in the multiset for which the query response contains no error. In deriving the correctness rate, we make one assumption: all hash functions are pairwisely independent. Before deriving this correctness rate, we first prove following theorem.

Fig. 17
figure 17

Assessment of compression size for QSGD

Theorem 3

In the MinMaxSketch, the value of any given counter is equal to the frequency of the least frequent element that maps to it.

Proof

We prove this theorem using mathematical induction on number of insertions, represented by k.

Base case k = 0 The theorem clearly holds for the base case, because before the insertions, the frequency of the least frequent element is 0, which is also the value of all counters.

Induction hypothesis k = n: Suppose the statement of the theorem holds true after n insertions.

Induction step k = \(n+1\): Let \((n+1)\)st insertion be of any element e that has previously been inserted a times. Let \(\alpha _i(k)\) represent the values of the counter \(F_i[h_i(e)\%w]\) after k insertions, where \(0\leqslant i\leqslant d-1\). There are two cases to consider: (1) e was the least frequent element when \(k=n\); (2) e was not the least frequent element when \(k=n\).

Case 1 If e was the least frequent element when \(k=n\), then according to our induction hypotheses, \(\alpha _i(n)=a\). After inserting e, it will still be the least frequent element and its frequency increases to \(a+1\). As per our MinMaxSketch scheme, the counter \(F_i[h_i(e)\%w]\) will be incremented once. Consequently, we get \(\alpha _i(n+1)=a+1\). Thus for this case, the theorem statement holds because the value of the counter \(F_i[h_i(e)\%w]\) after insertion is still equal to the frequency of the least frequent element, which is e.

Case 2 If e was not the least frequent element when \(k=n\), then according to our induction hypotheses, \(\alpha _i(n)>a\). After inserting e, it may or may not become the least frequent element. If it becomes the least frequent element, it means that \(\alpha _i(n)=a+1\) and as per our MinMaxSketch scheme, the counter \(F_i[h_i(e)\%w]\) will stay unchanged. Consequently, we get \(\alpha _i(n+1)=\alpha _i(n)=a+1\). Thus for this case, the theorem statement again holds because the value of the counter \(F_i[h_i(e)\%w]\) after insertion is equal to the frequency of the new least frequent element, which is e.

After inserting e, if it does not become the least frequent element, then it means \(\alpha _i(n)>a+1\) and as per our the MinMaxSketch scheme, the counter \(F_i[h_i(e)\%w]\) will stay unchanged. Consequently, \(\alpha _i(n+1)=\alpha _i(n)>a+1\). Thus, the theorem again holds because the value of the counter \(F_i[h_i(e)\%w]\) after insertion is still equal to the frequency of the element that was the least frequent after n insertions. \(\square \)

Next, we derive the correctness rate of the MinMaxSketch. Let v be the number of distinct elements inserted into the MinMaxSketch and are represented by \(e_1, e_2, \ldots , e_v\). Without loss of generality, let the element \(e_{l+1}\) be more frequent than \(e_l\), where \(1\leqslant l\leqslant v-1\). Let X be the random variable representing the number of elements hashing into the counter \(F_i[h_i(e_l)\%w]\) given the element \(e_l\), where \(0\leqslant i \leqslant d-1\) and \(1\leqslant l\leqslant v\). Clearly, \(X\sim \text {Binomial}(v-1, 1/w)\).

From Theorem 1, we conclude that if \(e_l\) has the highest frequency among all elements that map to the given counter \(F_i[h_i(e_l)\%w]\), then the query result for \(e_l\) will contain no error. Let A be the event that \(e_l\) has the maximum frequency among x elements that map to \(F_i[h_i(e_l)\%w]\). The probability \(P\{A\}\) is given by the following equation:

$$\begin{aligned} P\{A\}=\begin{pmatrix}{l-1}\\ {x-1} \end{pmatrix}/ \begin{pmatrix}{v-1}\\ {x-1}. \end{pmatrix} \quad (\text {where } x\leqslant l) \end{aligned}$$

Let \(P'\) represent the probability that the query result for \(e_l\) from any given counter contains no error. It is given by:

$$\begin{aligned} \begin{aligned} P'&=\sum _{x=1}^lP\{A\}\times P\{X=x\}\\&=\sum _{x=1}^l \dfrac{\left( {\begin{array}{c}l-1\\ x-1\end{array}}\right) }{\left( {\begin{array}{c}v-1\\ x-1\end{array}}\right) }\left( {\begin{array}{c}v-1\\ x-1\end{array}}\right) \Big (\frac{1}{w}\Big )^{x-1}\Big (1{-}\frac{1}{w}\Big )^{v-x} {=}\Big (1{-}\frac{1}{w}\Big )^{v-l}. \end{aligned} \end{aligned}$$

As there are d counters, the overall probability that the query result of \(e_l\) is correct is given by the following equation.

$$\begin{aligned} P_{\text {CR}}\{e_l\}=1-\bigg (1-\Big (1-\frac{1}{w}\Big )^{v-l}\bigg )^d. \end{aligned}$$

The equality above holds when all v elements have different frequencies. If two or more elements have equal frequencies, the correctness rate increases slightly. Consequently, the expected correctness rate Cr of the MinMaxSketch is bound by:

$$\begin{aligned} Cr\geqslant \dfrac{\sum _{l=1}^v P_{\text {CR}}\{e_l\}}{v} =\dfrac{\sum _{l=1}^v \left( 1- \left( 1-(1-\frac{1}{w})^{v-l}\right) ^d \right) }{v}.\nonumber \\ \end{aligned}$$
(2)

1.3 Delta-binary encoding

Delta-binary encoding is a lossless compression method, but its average space cost cannot be calculated exactly. Here, we focus on the expected size for one key. As aforementioned, we divide all the quantile buckets into r groups. Therefore, the number of nonzero keys that fall into the same group is expected to be \(\frac{d}{r}\). Assuming the arrangement of dimensions in dataset is random, the expected difference between two keys should be \(\frac{rD}{d}\). As a result, the expected bytes for each key is \(\left\lceil \log _{256}{\frac{rD}{d}} \right\rceil = \left\lceil \frac{1}{8} \log _{2}{\frac{rD}{d}} \right\rceil \). For instance, with \(r=8\), we can compress each key into 1 byte if we choose a large batch size such that \(\frac{d}{D}\geqslant \frac{1}{32}\).

Fortunately, the arrangement of dimensions in dataset is usually not random, i.e., dimensions with strong relationship happen to appear in consecutive keys, which makes the difference between two nonzero keys smaller. In practice, we find that the average size for one key (including two flag bits) is around 1.5 bytes.

Considering bitmap, another useful data structure for storing keys with compression rate up to 8. Nonetheless, in our framework, bitmap is not so useful as it should be. In order to indicate the keys for different groups, we have to create one bitmap for each of them, which comes out with \(\left\lceil \frac{rD}{8} \right\rceil \) bytes in total. As a result, delta-binary encoding is a better choice.

Tuning QSGD

To choose an appropriate compression size for QSGD, we conduct an experiment to compare different choices. Taking LR as a representative, we compare 4bits-QSGD, 8bits-QSGD, and Adam on KDD12 and CTR datasets. (2bits-QSGD is equal to TernGrad so we do not consider it) As shown in Fig. 17a, b, 8bits-QSGD achieves almost the same convergence rate (loss in terms of epoch) as Adam. Intuitively, the number of quantization buckets of QSGD is 256 when using 8 bits, which is able to provide enough precision for desirable convergence. When using 4 bits for QSGD, however, the convergence rate is slower than 8bits-QSGD and Adam. This is reasonable because the number of quantization buckets is only 16 for 4bits-QSGD, which inevitably incurs higher quantization error and harms the convergence.

We then assess the run time of QSGD with different numbers of bits. Both QSGD variants run faster than Adam, as shown in Fig. 17c, due to the reduction in communication cost. However, the run time does not significantly decrease along with the number of bits as expected. This is not a surprising phenomenon for two reasons: (i) since QSGD only compresses gradient values and stores gradient keys in 4-byte integers, the communication overhead only decreases from 5d bytes to 4.5d bytes, where d is the number of nonzero values in gradient; (ii) if the compression size is less than one byte, there is an extra overhead of bit manipulation during encoding and decoding, while we can use the primitive byte type to store the compressed value in 8bits-QSGD.

Owing to the experimental results, we determine to choose the compression size of QSGD as 8 bits in our end-to-end comparison in Sect. 7.

Effectiveness of adaptive learning rate

As introduced in Sect. 4.2, we introduce to solve the problem of vanishing gradient via an adaptive learning rate method. To choose an appropriate technique for adaptive learning rate, we compare Adam and AMSGrad by training LR on KDD10 dataset, which is described in Table 1. As shown in Fig. 18, the convergence rates of Adam and AMSGrad are almost the same without compression. AMSGrad achieves lower testing loss eventually but the performance gap is small, which is consistent with the results on LR in [44]. After applying SKCompress, their convergence rates are still matching. Although the convergence rate with SKCompress is slower than the counterpart in the first few epochs due to the property of under-estimation of MinMaxSketch, it eventually converges to a comparable testing loss with the help of adaptive learning rate.

Since Adam and AMSGrad have similar convergence performance, we choose Adam as our optimizer since it achieves the state-of-the-art performance and is one of the most widely-used adaptive methods.

Fig. 18
figure 18

The comparison of Adam and AMSGrad. The evaluated metric is the testing loss in terms of epochs. We plot convergence without SKCompress in solid lines and plot convergence with SKCompress in dashed lines

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, J., Fu, F., Yang, T. et al. SKCompress: compressing sparse and nonuniform gradient in distributed machine learning. The VLDB Journal 29, 945–972 (2020). https://doi.org/10.1007/s00778-019-00596-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00596-3

Keywords

Navigation