Skip to main content
Log in

\(FC^{2}\): cloud-based cluster provisioning for distributed machine learning

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provisioning capabilities. Two issues have arisen from this trend: (1) if not configured properly, training models on cloud-based clusters could incur significant cost and time, and (2) many researchers in machine learning tend to focus more on model and algorithm development, so they may not have the time or skills to deal with system setup, resource selection and configuration. In this work, we propose and implement \(FC^{2}\): a system for fast, convenient and cost-effective distributed machine learning over public cloud resources. Central to the effectiveness of \(FC^{2}\) is the ability to recommend an appropriate resource configuration in terms of cost and execution time for a given model training task. Our approach differs from previous work in that it does not need to manually analyze the code and dataset of the training task in advance. The recommended resource configuration can then be deployed and managed automatically by \(FC^2\) until the training task is completed. We have conducted extensive experiments with an implementation of \(FC^2\), using real-world deep neural network models and datasets. The results demonstrate the effectiveness of our approach, which could produce cost saving of up to 80% while maintaining similar training performance compared to much more expensive resource configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. Model-parallel is another approach to speed up the training, which is beyond the scope of this paper.

  2. \(n_{t}\) should be initialized to a very large value.

  3. https://github.com/boto/boto3.

  4. http://www.paramiko.org

  5. https://pymotw.com/2/subprocess.

  6. https://github.com/raboof/nethogs.

  7. \(FC^2\) provides a number of the most popular training datasets via AWS Elastic File System.

  8. https://code.google.com/archive/p/cuda-convnet.

  9. EC2 only mentions that the network performance of these instance types is classified as High. More information is available from https://aws.amazon.com/ec2/instance-types/.

  10. Similar results have also been obtained for larger cluster sizes.

  11. The parameter p in the Scala-Opt algorithms is set to 0.95 to avoid bandwidth saturation at the parameter server, which has a capacity of around 450Mbps.

References

  1. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)

  2. Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. OSDI 14, 583–598 (2014)

    Google Scholar 

  3. Ulanov, A., Simanovsky, A., Marwah, M.: Modeling scalability of distributed machine learning. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 1249–1254. IEEE (2017)

  4. Yan, F., Ruwase, O., He, Y., Chilimbi, T.: Performance modeling and scalability optimization of distributed deep learning systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1355–1364. ACM (2015)

  5. Amazon Machine Learning. https://aws.amazon.com/machine-learning. August 2018

  6. Microsoft Azure Machine Learning Studio. https://studio.azureml.net. August 2018

  7. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)

  8. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)

    Google Scholar 

  9. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)

  10. Chilimbi, T.M., Suzue, Y., Apacible, J., Kalyanaraman, K.: Project adam: building an efficient and scalable deep learning training system. OSDI 14, 571–582 (2014)

    Google Scholar 

  11. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

  12. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)

    Article  Google Scholar 

  13. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)

    Article  Google Scholar 

  14. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017)

  15. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

    Article  Google Scholar 

  16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  17. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, Vol. 1, p. 3 (2017)

  18. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)

  19. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  20. Wang, W., Chen, G., Chen, H., Dinh, T.T.A., Gao, J., Ooi, B.C., Tan, K.L., Wang, S., Zhang, M.: Deep learning at scale and at ease. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 12(4s), 69 (2016)

    Google Scholar 

  21. Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)

    Article  Google Scholar 

  22. Watcharapichat, P., Morales, V.L., Fernandez, R.C., Pietzuch, P.: Ako: Decentralised deep learning with partial gradient exchange. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 84–97. ACM (2016)

  23. Jonas, E., Pu, Q., Venkataraman, S., Stoica, I., Recht, B.: Occupy the cloud: distributed computing for the 99%. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 445–451. ACM (2017)

  24. Duong, T.N.B., Zhong, J., Cai, W., Li, Z., Zhou, S.: Ra2: Predicting simulation execution time for cloud-based design space explorations. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real-Time Applications, pp. 120–127. IEEE Press (2016)

  25. Yan, F., Ruwase, O., He, Y., Smirni, E.: Serf: efficient scheduling for fast deep neural network serving via judicious parallelism. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 300–311. IEEE (2016)

  26. Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)

  27. Oyama, Y., Nomura, A., Sato, I., Nishimura, H., Tamatsu, Y., Matsuoka, S.: Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 66–75. IEEE (2016)

  28. Li, A., Zong, X., Kandula, S., Yang, X., Zhang, M.: Cloudprophet: towards application performance prediction in cloud. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 426–427. ACM (2011)

  29. Cunha, M., Mendonça, N., Sampaio, A.: Cloud crawler: a declarative performance evaluation environment for infrastructure-as-a-service clouds. Concurr. Comput. Pract. Exp. 29(1), e3825 (2017)

    Article  Google Scholar 

  30. Li, H.W., Wu, Y.S., Chen, Y.Y., Wang, C.M., Huang, Y.N.: Application execution time prediction for effective cpu provisioning in virtualization environment. IEEE Trans. Parallel Distrib. Syst. 28(11), 3074–3088 (2017)

    Article  Google Scholar 

  31. Evangelinou, A., Ciavotta, M., Ardagna, D., Kopaneli, A., Kousiouris, G., Varvarigou, T.: Enterprise applications cloud rightsizing through a joint benchmarking and optimization approach. Future Gener. Comput. Syst. 78, 102–114 (2018)

    Article  Google Scholar 

  32. Cui, H., Cipar, J., Ho, Q., Kim, J.K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G.R., Gibbons, P.B., et al.: Exploiting bounded staleness to speed up big data analytics. In: USENIX Annual Technical Conference, pp. 37–48 (2014)

  33. Sun, P., Wen, Y., Duong, T.N.B., Yan, S.: Timed dataflow: Reducing communication overhead for distributed machine learning systems. In: 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp. 1110–1117. IEEE (2016)

  34. Sun, P., Wen, Y., Ta, N.B.D., Yan, S.: Towards distributed machine learning in shared clusters: a dynamically-partitioned approach. In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 1–6. IEEE (2017)

  35. Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., Li, H.: Terngrad: Ternary gradients to reduce communication in distributed deep learning. In: Advances in Neural Information Processing Systems, pp. 1509–1519 (2017)

  36. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

  37. Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017)

  38. Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference. ACM (2018)

  39. Google Cloud AI. https://cloud.google.com/products/ai. August 2018

  40. BigML. https://bigml.com. August 2018

  41. Amazon Deep Learning AMIs. https://aws.amazon.com/machine-learning/amis. August 2018

  42. AWS CloudFormation. https://aws.amazon.com/cloudformation. August 2018

  43. Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)

    Article  Google Scholar 

  44. Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, pp. 19–27 (2014)

  45. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto (2009)

Download references

Acknowledgements

The research has been supported via the Academic Research Fund (AcRF) Tier 1 Grant RG121/15.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ta Nguyen Binh Duong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ta, N.B.D. \(FC^{2}\): cloud-based cluster provisioning for distributed machine learning. Cluster Comput 22, 1299–1315 (2019). https://doi.org/10.1007/s10586-019-02912-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02912-6

Keywords

Navigation