# SKCompress: compressing sparse and nonuniform gradient in distributed machine learning

- 41 Downloads

## Abstract

Distributed machine learning (ML) has been extensively studied to meet the explosive growth of training data. A wide range of machine learning models are trained by a family of first-order optimization algorithms, i.e., stochastic gradient descent (SGD). The core operation of SGD is the calculation of gradients. When executing SGD in a distributed environment, the workers need to exchange local gradients through the network. In order to reduce the communication cost, a category of quantification-based compression algorithms are used to transform the gradients to binary format, at the expense of a low precision loss. Although the existing approaches work fine for dense gradients, we find that these methods are ill-suited for many cases where the gradients are sparse and nonuniformly distributed. In this paper, we study *is there a compression framework that can efficiently handle sparse and nonuniform gradients?* We propose a general compression framework, called SKCompress, to compress both gradient values and gradient keys in sparse gradients. Our first contribution is a sketch-based method that compresses the gradient values. Sketch is a class of algorithm that approximates the distribution of a data stream with a probabilistic data structure. We first use a quantile sketch to generate splits, sort gradient values into buckets, and encode them with the bucket indexes. Our second contribution is a new sketch algorithm, namely MinMaxSketch, which compresses the bucket indexes. MinMaxSketch builds a set of hash tables and solves hash collisions with a MinMax strategy. Since the bucket indexes are nonuniform, we further adopt Huffman coding to compress MinMaxSketch. To compress the keys of sparse gradients, the third contribution of this paper is a delta-binary encoding method that calculates the increment of the gradient keys and encode them with binary format. An adaptive prefix is proposed to assign different sizes to different gradient keys, so that we can save more space. We also theoretically discuss the correctness and the error bound of our proposed methods. To the best of our knowledge, this is the first effort utilizing data sketch to compress gradients in ML. We implement a prototype system in a real cluster of our industrial partner Tencent Inc. and show that our method is up to \(12\times \) faster than the existing methods.

## Keywords

Distributed machine learning Stochastic gradient descent Quantification Quantile sketch Frequency sketch Huffman coding## Notes

### Acknowledgements

This work is supported by NSFC (No. 61832001, 61702016, 61702015, 61572039, U1936104), the National Key Research and Development Program of China (No. 2018YFB1004403), Beijing Academy of Artificial Intelligence (BAAI), PKU-Tencent joint research Lab, and the project PCL Future “Regional Network Facilities for Large-scale Experiments and Applications under Grant PCL2018KP001”.

## Supplementary material

## References

- 1.Alistarh, D., Li, J., Tomioka, R., Vojnovic, M.: Qsgd: randomized quantization for communication-optimal stochastic gradient descent. arXiv preprint. arXiv:1610.02132 (2016)
- 2.Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., et al.: A view of cloud computing. Commun. ACM
**53**(4), 50–58 (2010)CrossRefGoogle Scholar - 3.Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on cuda. Tech. rep., Nvidia Corporation (2008)Google Scholar
- 4.Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng.
**4**(3), 269–289 (2019)CrossRefGoogle Scholar - 5.Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186 (2010)CrossRefGoogle Scholar
- 6.Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436 (2012)Google Scholar
- 7.Bubeck, S., et al.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn.
**8**(3–4), 231–357 (2015)CrossRefGoogle Scholar - 8.Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)Google Scholar
- 9.Cifar: Cifar dataset. https://www.cs.toronto.edu/~kriz/cifar.html
- 10.Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms
**55**(1), 58–75 (2005)MathSciNetCrossRefGoogle Scholar - 11.Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)Google Scholar
- 12.Deutsch, P.: Deflate compressed data format specification version 1.3. Tech. rep. (1996)Google Scholar
- 13.Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
**12**(Jul), 2121–2159 (2011)MathSciNetzbMATHGoogle Scholar - 14.Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. ACM SIGMOD Record
**30**, 58–66 (2001)CrossRefGoogle Scholar - 15.Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (iot): a vision, architectural elements, and future directions. Future Gener. Comput. Syst.
**29**(7), 1645–1660 (2013)CrossRefGoogle Scholar - 16.Hinds, S.C., Fisher, J.L., D’Amato, D.P.: A document skew detection method using run-length encoding and the hough transform. In: Pattern Recognition, 1990. Proceedings., 10th International Conference on, Vol. 1, pp. 464–468. IEEE (1990)Google Scholar
- 17.Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J.K., Gibbons, P.B., Gibson, G.A., Ganger, G., Xing, E.P.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2013)Google Scholar
- 18.Hosmer Jr., D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression, vol. 398. Wiley, New York (2013) CrossRefGoogle Scholar
- 19.Huang, Y., Cui, B., Zhang, W., Jiang, J., Xu, Y.: Tencentrec: real-time stream recommendation in practice. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 227–238. ACM (2015)Google Scholar
- 20.Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478. ACM (2017)Google Scholar
- 21.Jiang, J., Huang, M., Jiang, J., Cui, B.: Teslaml: steering machine learning automatically in tencent. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data, pp. 313–318. Springer (2017)Google Scholar
- 22.Jiang, J., Jiang, J., Cui, B., Zhang, C.: Tencentboost: a gradient boosting tree system with parameter server. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 281–284 (2017)Google Scholar
- 23.Jiang, J., Tong, Y., Lu, H., Cui, B., Lei, K., Yu, L.: Gvos: a general system for near-duplicate video-related applications on storm. ACM Trans. Inf. Syst. (TOIS)
**36**(1), 3 (2017)CrossRefGoogle Scholar - 24.Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)Google Scholar
- 25.KDD: Kdd cup 2010 (2010). http://www.kdd.org/kdd-cup/
- 26.KDD: Kdd cup 2012 (2012). https://www.kaggle.com/c/kddcup2012-track1
- 27.Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980 (2014)
- 28.Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint. arXiv:1609.02907 (2016)
- 29.Knuth, D.E.: Dynamic Huffman coding. J. Algorithms
**6**(2), 163–180 (1985)MathSciNetCrossRefGoogle Scholar - 30.Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances In Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
- 31.Li, B., Drozd, A., Guo, Y., Liu, T., Matsuoka, S., Du, X.: Scaling word2vec on big corpus. Data Sci. Eng. 1–19 (2019)Google Scholar
- 32.Li, M., Liu, Z., Smola, A.J., Wang, Y.X.: Difacto: distributed factorization machines. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 377–386. ACM (2016)Google Scholar
- 33.Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient mini-batch training for stochastic optimization. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 661–670. ACM (2014)Google Scholar
- 34.Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint. arXiv:1509.02971 (2015)
- 35.Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint. arXiv:1712.01887 (2017)
- 36.McMahan, B., Streeter, M.: Delay-tolerant algorithms for asynchronous distributed online learning. In: Advances in Neural Information Processing Systems, pp. 2915–2923 (2014)Google Scholar
- 37.Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)Google Scholar
- 38.Neelakantan, A., Vilnis, L., Le, Q.V., Sutskever, I., Kaiser, L., Kurach, K., Martens, J.: Adding gradient noise improves learning for very deep networks. arXiv preprint. arXiv:1511.06807 (2015)
- 39.Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim.
**19**(4), 1574–1609 (2009)MathSciNetCrossRefGoogle Scholar - 40.Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence o (1/k\(^\wedge \) 2). Doklady AN USSR
**269**, 543–547 (1983)Google Scholar - 41.Parnell, T., Dünner, C., Atasu, K., Sifalakis, M., Pozidis, H.: Large-scale stochastic learning using gpus. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 419–428 (2017)Google Scholar
- 42.Parnell, T., Dünner, C., Atasu, K., Sifalakis, M., Pozidis, H.: Tera-scale coordinate descent on gpus. Future Gener. Comput. Syst. (2018)Google Scholar
- 43.Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw.
**12**(1), 145–151 (1999)MathSciNetCrossRefGoogle Scholar - 44.Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. arXiv preprint. arXiv:1904.09237 (2019)
- 45.Rendle, S., Fetterly, D., Shekita, E.J., Su, B.y.: Robust large-scale machine learning in the cloud. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1125–1134 (2016)Google Scholar
- 46.Seber, G.A., Lee, A.J.: Linear Regression Analysis, vol. 936. Wiley, Hoboken (2012)zbMATHGoogle Scholar
- 47.Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In: INTERSPEECH, pp. 1058–1062 (2014)Google Scholar
- 48.Stich, S.U., Cordonnier, J.B., Jaggi, M.: Sparsified sgd with memory. In: Advances in Neural Information Processing Systems, pp. 4447–4458 (2018)Google Scholar
- 49.Suykens, J.A., Vandewalle, J.: Least squares support vector machine classifiers. Neural Process. Lett.
**9**(3), 293–300 (1999)CrossRefGoogle Scholar - 50.Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: Advances in Neural Information Processing Systems, pp. 7652–7662 (2018)Google Scholar
- 51.Tewarson, R.P.: Sparse Matrices. Academic Press, New York (1973)zbMATHGoogle Scholar
- 52.Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., Wright, S.: Atomo: communication-efficient learning via atomic sparsification. In: Advances in Neural Information Processing Systems, pp. 9850–9861 (2018)Google Scholar
- 53.Wang, Y., Lin, P., Hong, Y.: Distributed regression estimation with incomplete data in multi-agent networks. Sci. China Inf. Sci.
**61**(9), 092202 (2018)MathSciNetCrossRefGoogle Scholar - 54.Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., Li, H.: Terngrad: ternary gradients to reduce communication in distributed deep learning. In: Advances in Neural Information Processing Systems, pp. 1509–1519 (2017)Google Scholar
- 55.Wu, J., Huang, W., Huang, J., Zhang, T.: Error compensated quantized sgd and its applications to large-scale distributed optimization. arXiv preprint. arXiv:1806.08054 (2018)
- 56.Yahoo: Data sketches (2004). https://datasketches.github.io/
- 57.Yang, T., Jiang, J., Liu, P., Huang, Q., Gong, J., Zhou, Y., Miao, R., Li, X., Uhlig, S.: Elastic sketch: adaptive and fast network-wide measurements. In: ACM SIGCOMM, pp. 561–575 (2018)Google Scholar
- 58.Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB Endow.
**9**(5), 408–419 (2016)CrossRefGoogle Scholar - 59.Yu, L., Zhang, C., Shao, Y., Cui, B.: Lda*: a robust and large-scale topic modeling system. Proc. VLDB Endow.
**10**(11), 1406–1417 (2017)CrossRefGoogle Scholar - 60.Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint. arXiv:1212.5701 (2012)
- 61.Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow.
**7**(12), 1283–1294 (2014)CrossRefGoogle Scholar - 62.Zhang, H., Kara, K., Li, J., Alistarh, D., Liu, J., Zhang, C.: Zipml: an end-to-end bitwise framework for dense generalized linear models. arXiv:1611.05402 (2016)
- 63.Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: Scientific and Statistical Database Management, 2007. 19th International Conference on SSBDM’07, pp. 29–29. IEEE (2007)Google Scholar
- 64.Zheng, T., Chen, G., Wang, X., Chen, C., Wang, X., Luo, S.: Real-time intelligent big data processing: technology, platform, and applications. Sci. China Inf. Sci.
**62**(8), 82101 (2019)CrossRefGoogle Scholar - 65.Zheng, Y., Zhang, L., Xie, X., Ma, W.Y.: Mining interesting locations and travel sequences from gps trajectories. In: Proceedings of the 18th International Conference on World Wide Web, pp. 791–800. ACM (2009)Google Scholar
- 66.Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 2595–2603 (2010)Google Scholar
- 67.Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory
**23**(3), 337–343 (1977)MathSciNetCrossRefGoogle Scholar