SKCompress: compressing sparse and nonuniform gradient in distributed machine learning

  • Jiawei Jiang
  • Fangcheng Fu
  • Tong YangEmail author
  • Yingxia Shao
  • Bin Cui
Regular Paper


Distributed machine learning (ML) has been extensively studied to meet the explosive growth of training data. A wide range of machine learning models are trained by a family of first-order optimization algorithms, i.e., stochastic gradient descent (SGD). The core operation of SGD is the calculation of gradients. When executing SGD in a distributed environment, the workers need to exchange local gradients through the network. In order to reduce the communication cost, a category of quantification-based compression algorithms are used to transform the gradients to binary format, at the expense of a low precision loss. Although the existing approaches work fine for dense gradients, we find that these methods are ill-suited for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study is there a compression framework that can efficiently handle sparse and nonuniform gradients? We propose a general compression framework, called SKCompress, to compress both gradient values and gradient keys in sparse gradients. Our first contribution is a sketch-based method that compresses the gradient values. Sketch is a class of algorithm that approximates the distribution of a data stream with a probabilistic data structure. We first use a quantile sketch to generate splits, sort gradient values into buckets, and encode them with the bucket indexes. Our second contribution is a new sketch algorithm, namely MinMaxSketch, which compresses the bucket indexes. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. Since the bucket indexes are nonuniform, we further adopt Huffman coding to compress MinMaxSketch. To compress the keys of sparse gradients, the third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and encode them with binary format. An adaptive prefix is proposed to assign different sizes to different gradient keys, so that we can save more space. We also theoretically discuss the correctness and the error bound of our proposed methods. To the best of our knowledge, this is the first effort utilizing data sketch to compress gradients in ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc. and show that our method is up to \(12\times \) faster than the existing methods.


Distributed machine learning Stochastic gradient descent Quantification Quantile sketch Frequency sketch Huffman coding 



This work is supported by NSFC (No. 61832001, 61702016, 61702015, 61572039, U1936104), the National Key Research and Development Program of China (No. 2018YFB1004403), Beijing Academy of Artificial Intelligence (BAAI), PKU-Tencent joint research Lab, and the project PCL Future “Regional Network Facilities for Large-scale Experiments and Applications under Grant PCL2018KP001”.

Supplementary material


  1. 1.
    Alistarh, D., Li, J., Tomioka, R., Vojnovic, M.: Qsgd: randomized quantization for communication-optimal stochastic gradient descent. arXiv preprint. arXiv:1610.02132 (2016)
  2. 2.
    Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: A view of cloud computing. Commun. ACM 53(4), 50–58 (2010)CrossRefGoogle Scholar
  3. 3.
    Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on cuda. Tech. rep., Nvidia Corporation (2008)Google Scholar
  4. 4.
    Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)CrossRefGoogle Scholar
  5. 5.
    Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186 (2010)CrossRefGoogle Scholar
  6. 6.
    Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436 (2012)Google Scholar
  7. 7.
    Bubeck, S., et al.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)CrossRefGoogle Scholar
  8. 8.
    Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)Google Scholar
  9. 9.
  10. 10.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)Google Scholar
  12. 12.
    Deutsch, P.: Deflate compressed data format specification version 1.3. Tech. rep. (1996)Google Scholar
  13. 13.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 58–66 (2001)CrossRefGoogle Scholar
  15. 15.
    Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (iot): a vision, architectural elements, and future directions. Future Gener. Comput. Syst. 29(7), 1645–1660 (2013)CrossRefGoogle Scholar
  16. 16.
    Hinds, S.C., Fisher, J.L., D’Amato, D.P.: A document skew detection method using run-length encoding and the hough transform. In: Pattern Recognition, 1990. Proceedings., 10th International Conference on, Vol. 1, pp. 464–468. IEEE (1990)Google Scholar
  17. 17.
    Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J.K., Gibbons, P.B., Gibson, G.A., Ganger, G., Xing, E.P.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2013)Google Scholar
  18. 18.
    Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression, vol. 398. Wiley, New York (2013) CrossRefGoogle Scholar
  19. 19.
    Huang, Y., Cui, B., Zhang, W., Jiang, J., Xu, Y.: Tencentrec: real-time stream recommendation in practice. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 227–238. ACM (2015)Google Scholar
  20. 20.
    Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478. ACM (2017)Google Scholar
  21. 21.
    Jiang, J., Huang, M., Jiang, J., Cui, B.: Teslaml: steering machine learning automatically in tencent. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, pp. 313–318. Springer (2017)Google Scholar
  22. 22.
    Jiang, J., Jiang, J., Cui, B., Zhang, C.: Tencentboost: a gradient boosting tree system with parameter server. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 281–284 (2017)Google Scholar
  23. 23.
    Jiang, J., Tong, Y., Lu, H., Cui, B., Lei, K., Yu, L.: Gvos: a general system for near-duplicate video-related applications on storm. ACM Trans. Inf. Syst. (TOIS) 36(1), 3 (2017)CrossRefGoogle Scholar
  24. 24.
    Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)Google Scholar
  25. 25.
    KDD: Kdd cup 2010 (2010).
  26. 26.
  27. 27.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980 (2014)
  28. 28.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint. arXiv:1609.02907 (2016)
  29. 29.
    Knuth, D.E.: Dynamic Huffman coding. J. Algorithms 6(2), 163–180 (1985)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances In Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  31. 31.
    Li, B., Drozd, A., Guo, Y., Liu, T., Matsuoka, S., Du, X.: Scaling word2vec on big corpus. Data Sci. Eng. 1–19 (2019)Google Scholar
  32. 32.
    Li, M., Liu, Z., Smola, A.J., Wang, Y.X.: Difacto: distributed factorization machines. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 377–386. ACM (2016)Google Scholar
  33. 33.
    Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 661–670. ACM (2014)Google Scholar
  34. 34.
    Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint. arXiv:1509.02971 (2015)
  35. 35.
    Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint. arXiv:1712.01887 (2017)
  36. 36.
    McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. In: Advances in Neural Information Processing Systems, pp. 2915–2923 (2014)Google Scholar
  37. 37.
    Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)Google Scholar
  38. 38.
    Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks. arXiv preprint. arXiv:1511.06807 (2015)
  39. 39.
    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/k\(^\wedge \) 2). Doklady AN USSR 269, 543–547 (1983)Google Scholar
  41. 41.
    Parnell, T., Dünner, C., Atasu, K., Sifalakis, M., Pozidis, H.: Large-scale stochastic learning using gpus. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 419–428 (2017)Google Scholar
  42. 42.
    Parnell, T., Dünner, C., Atasu, K., Sifalakis, M., Pozidis, H.: Tera-scale coordinate descent on gpus. Future Gener. Comput. Syst. (2018)Google Scholar
  43. 43.
    Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. arXiv preprint. arXiv:1904.09237 (2019)
  45. 45.
    Rendle, S., Fetterly, D., Shekita, E.J., Su, B.y.: Robust large-scale machine learning in the cloud. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1125–1134 (2016)Google Scholar
  46. 46.
    Seber, G.A., Lee, A.J.: Linear Regression Analysis, vol. 936. Wiley, Hoboken (2012)zbMATHGoogle Scholar
  47. 47.
    Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In: INTERSPEECH, pp. 1058–1062 (2014)Google Scholar
  48. 48.
    Stich, S.U., Cordonnier, J.B., Jaggi, M.: Sparsified sgd with memory. In: Advances in Neural Information Processing Systems, pp. 4447–4458 (2018)Google Scholar
  49. 49.
    Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett. 9(3), 293–300 (1999)CrossRefGoogle Scholar
  50. 50.
    Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: Advances in Neural Information Processing Systems, pp. 7652–7662 (2018)Google Scholar
  51. 51.
    Tewarson, R.P.: Sparse Matrices. Academic Press, New York (1973)zbMATHGoogle Scholar
  52. 52.
    Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., Wright, S.: Atomo: communication-efficient learning via atomic sparsification. In: Advances in Neural Information Processing Systems, pp. 9850–9861 (2018)Google Scholar
  53. 53.
    Wang, Y., Lin, P., Hong, Y.: Distributed regression estimation with incomplete data in multi-agent networks. Sci. China Inf. Sci. 61(9), 092202 (2018)MathSciNetCrossRefGoogle Scholar
  54. 54.
    Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., Li, H.: Terngrad: ternary gradients to reduce communication in distributed deep learning. In: Advances in Neural Information Processing Systems, pp. 1509–1519 (2017)Google Scholar
  55. 55.
    Wu, J., Huang, W., Huang, J., Zhang, T.: Error compensated quantized sgd and its applications to large-scale distributed optimization. arXiv preprint. arXiv:1806.08054 (2018)
  56. 56.
    Yahoo: Data sketches (2004).
  57. 57.
    Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou, Y., Miao, R., Li, X., Uhlig, S.: Elastic sketch: adaptive and fast network-wide measurements. In: ACM SIGCOMM, pp. 561–575 (2018)Google Scholar
  58. 58.
    Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB Endow. 9(5), 408–419 (2016)CrossRefGoogle Scholar
  59. 59.
    Yu, L., Zhang, C., Shao, Y., Cui, B.: Lda*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)CrossRefGoogle Scholar
  60. 60.
    Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint. arXiv:1212.5701 (2012)
  61. 61.
    Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 1283–1294 (2014)CrossRefGoogle Scholar
  62. 62.
    Zhang, H., Kara, K., Li, J., Alistarh, D., Liu, J., Zhang, C.: Zipml: an end-to-end bitwise framework for dense generalized linear models. arXiv:1611.05402 (2016)
  63. 63.
    Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: Scientific and Statistical Database Management, 2007. 19th International Conference on SSBDM’07, pp. 29–29. IEEE (2007)Google Scholar
  64. 64.
    Zheng, T., Chen, G., Wang, X., Chen, C., Wang, X., Luo, S.: Real-time intelligent big data processing: technology, platform, and applications. Sci. China Inf. Sci. 62(8), 82101 (2019)CrossRefGoogle Scholar
  65. 65.
    Zheng, Y., Zhang, L., Xie, X., Ma, W.Y.: Mining interesting locations and travel sequences from gps trajectories. In: Proceedings of the 18th International Conference on World Wide Web, pp. 791–800. ACM (2009)Google Scholar
  66. 66.
    Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)Google Scholar
  67. 67.
    Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Authors and Affiliations

  1. 1.School of EECS & Key Laboratory of High Confidence Software Technologies (MOE)Peking UniversityBeijingChina
  2. 2.Department of Computer ScienceETH ZürichZürichSwitzerland
  3. 3.Department of Data PlatformTEG, Tencent Inc.BeijingChina
  4. 4.Department of Computer Science and TechnologyPeking UniversityBeijingChina
  5. 5.Peng Cheng LaboratoryShenzhenChina
  6. 6.School of Computer Science & Beijing Key Lab of Intelligent Telecommunications Software and MultimediaBeijing University of Posts and TelecommunicationsBeijingChina
  7. 7.Department of Computer Science and Technology & Key Laboratory of High Confidence Software Technologies (MOE)Peking UniversityBeijingChina

Personalised recommendations