\(FC^{2}\): cloud-based cluster provisioning for distributed machine learning

  • Ta Nguyen Binh Duong Email author


Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provisioning capabilities. Two issues have arisen from this trend: (1) if not configured properly, training models on cloud-based clusters could incur significant cost and time, and (2) many researchers in machine learning tend to focus more on model and algorithm development, so they may not have the time or skills to deal with system setup, resource selection and configuration. In this work, we propose and implement \(FC^{2}\): a system for fast, convenient and cost-effective distributed machine learning over public cloud resources. Central to the effectiveness of \(FC^{2}\) is the ability to recommend an appropriate resource configuration in terms of cost and execution time for a given model training task. Our approach differs from previous work in that it does not need to manually analyze the code and dataset of the training task in advance. The recommended resource configuration can then be deployed and managed automatically by \(FC^2\) until the training task is completed. We have conducted extensive experiments with an implementation of \(FC^2\), using real-world deep neural network models and datasets. The results demonstrate the effectiveness of our approach, which could produce cost saving of up to 80% while maintaining similar training performance compared to much more expensive resource configurations.


Distributed machine learning Cloud-based clusters Resource recommendation Cluster deployment 



The research has been supported via the Academic Research Fund (AcRF) Tier 1 Grant RG121/15.


  1. 1.
    Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)Google Scholar
  2. 2.
    Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. OSDI 14, 583–598 (2014)Google Scholar
  3. 3.
    Ulanov, A., Simanovsky, A., Marwah, M.: Modeling scalability of distributed machine learning. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 1249–1254. IEEE (2017)Google Scholar
  4. 4.
    Yan, F., Ruwase, O., He, Y., Chilimbi, T.: Performance modeling and scalability optimization of distributed deep learning systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1355–1364. ACM (2015)Google Scholar
  5. 5.
    Amazon Machine Learning. August 2018
  6. 6.
    Microsoft Azure Machine Learning Studio. August 2018
  7. 7.
    Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
  8. 8.
    Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)Google Scholar
  9. 9.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  10. 10.
    Chilimbi, T.M., Suzue, Y., Apacible, J., Kalyanaraman, K.: Project adam: building an efficient and scalable deep learning training system. OSDI 14, 571–582 (2014)Google Scholar
  11. 11.
    Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)Google Scholar
  12. 12.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)CrossRefGoogle Scholar
  13. 13.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)CrossRefGoogle Scholar
  14. 14.
    Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017)
  15. 15.
    Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)CrossRefGoogle Scholar
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  17. 17.
    Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, Vol. 1, p. 3 (2017)Google Scholar
  18. 18.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)Google Scholar
  19. 19.
    Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  20. 20.
    Wang, W., Chen, G., Chen, H., Dinh, T.T.A., Gao, J., Ooi, B.C., Tan, K.L., Wang, S., Zhang, M.: Deep learning at scale and at ease. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 12(4s), 69 (2016)Google Scholar
  21. 21.
    Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)CrossRefGoogle Scholar
  22. 22.
    Watcharapichat, P., Morales, V.L., Fernandez, R.C., Pietzuch, P.: Ako: Decentralised deep learning with partial gradient exchange. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 84–97. ACM (2016)Google Scholar
  23. 23.
    Jonas, E., Pu, Q., Venkataraman, S., Stoica, I., Recht, B.: Occupy the cloud: distributed computing for the 99%. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 445–451. ACM (2017)Google Scholar
  24. 24.
    Duong, T.N.B., Zhong, J., Cai, W., Li, Z., Zhou, S.: Ra2: Predicting simulation execution time for cloud-based design space explorations. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real-Time Applications, pp. 120–127. IEEE Press (2016)Google Scholar
  25. 25.
    Yan, F., Ruwase, O., He, Y., Smirni, E.: Serf: efficient scheduling for fast deep neural network serving via judicious parallelism. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 300–311. IEEE (2016)Google Scholar
  26. 26.
    Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)
  27. 27.
    Oyama, Y., Nomura, A., Sato, I., Nishimura, H., Tamatsu, Y., Matsuoka, S.: Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 66–75. IEEE (2016)Google Scholar
  28. 28.
    Li, A., Zong, X., Kandula, S., Yang, X., Zhang, M.: Cloudprophet: towards application performance prediction in cloud. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 426–427. ACM (2011)Google Scholar
  29. 29.
    Cunha, M., Mendonça, N., Sampaio, A.: Cloud crawler: a declarative performance evaluation environment for infrastructure-as-a-service clouds. Concurr. Comput. Pract. Exp. 29(1), e3825 (2017)CrossRefGoogle Scholar
  30. 30.
    Li, H.W., Wu, Y.S., Chen, Y.Y., Wang, C.M., Huang, Y.N.: Application execution time prediction for effective cpu provisioning in virtualization environment. IEEE Trans. Parallel Distrib. Syst. 28(11), 3074–3088 (2017)CrossRefGoogle Scholar
  31. 31.
    Evangelinou, A., Ciavotta, M., Ardagna, D., Kopaneli, A., Kousiouris, G., Varvarigou, T.: Enterprise applications cloud rightsizing through a joint benchmarking and optimization approach. Future Gener. Comput. Syst. 78, 102–114 (2018)CrossRefGoogle Scholar
  32. 32.
    Cui, H., Cipar, J., Ho, Q., Kim, J.K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G.R., Gibbons, P.B., et al.: Exploiting bounded staleness to speed up big data analytics. In: USENIX Annual Technical Conference, pp. 37–48 (2014)Google Scholar
  33. 33.
    Sun, P., Wen, Y., Duong, T.N.B., Yan, S.: Timed dataflow: Reducing communication overhead for distributed machine learning systems. In: 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp. 1110–1117. IEEE (2016)Google Scholar
  34. 34.
    Sun, P., Wen, Y., Ta, N.B.D., Yan, S.: Towards distributed machine learning in shared clusters: a dynamically-partitioned approach. In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 1–6. IEEE (2017)Google Scholar
  35. 35.
    Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., Li, H.: Terngrad: Ternary gradients to reduce communication in distributed deep learning. In: Advances in Neural Information Processing Systems, pp. 1509–1519 (2017)Google Scholar
  36. 36.
    Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)Google Scholar
  37. 37.
    Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017)
  38. 38.
    Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference. ACM (2018)Google Scholar
  39. 39.
    Google Cloud AI. August 2018
  40. 40.
    BigML. August 2018
  41. 41.
    Amazon Deep Learning AMIs. August 2018
  42. 42.
    AWS CloudFormation. August 2018
  43. 43.
    Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)CrossRefGoogle Scholar
  44. 44.
    Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, pp. 19–27 (2014)Google Scholar
  45. 45.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto (2009)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringNanyang Technological UniversitySingaporeSingapore

Personalised recommendations