$$FC^{2}$$ : cloud-based cluster provisioning for distributed machine learning

Ta, Nguyen Binh Duong

doi:10.1007/s10586-019-02912-6

$FC^{2}$: cloud-based cluster provisioning for distributed machine learning

Published: 08 February 2019

Volume 22, pages 1299–1315, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Ta Nguyen Binh Duong ORCID: orcid.org/0000-0002-2882-2837¹

359 Accesses
2 Citations
6 Altmetric
Explore all metrics

Abstract

Training large, complex machine learning models such as deep neural networks with big data requires powerful computing clusters, which are costly to acquire, use and maintain. As a result, many machine learning researchers turn to cloud computing services for on-demand and elastic resource provisioning capabilities. Two issues have arisen from this trend: (1) if not configured properly, training models on cloud-based clusters could incur significant cost and time, and (2) many researchers in machine learning tend to focus more on model and algorithm development, so they may not have the time or skills to deal with system setup, resource selection and configuration. In this work, we propose and implement $FC^{2}$: a system for fast, convenient and cost-effective distributed machine learning over public cloud resources. Central to the effectiveness of $FC^{2}$ is the ability to recommend an appropriate resource configuration in terms of cost and execution time for a given model training task. Our approach differs from previous work in that it does not need to manually analyze the code and dataset of the training task in advance. The recommended resource configuration can then be deployed and managed automatically by $FC^2$ until the training task is completed. We have conducted extensive experiments with an implementation of $FC^2$, using real-world deep neural network models and datasets. The results demonstrate the effectiveness of our approach, which could produce cost saving of up to 80% while maintaining similar training performance compared to much more expensive resource configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

How Deep Learning Model Architecture and Software Stack Impacts Training Performance in the Cloud

An Approach to Automatic Performance Prediction for Cloud-Enhanced Mobile Applications with Sparse Data

Article 20 September 2017

Wei-Qing Liu & Jing Li

Notes

Model-parallel is another approach to speed up the training, which is beyond the scope of this paper.
$n_{t}$ should be initialized to a very large value.
https://github.com/boto/boto3.
http://www.paramiko.org
https://pymotw.com/2/subprocess.
https://github.com/raboof/nethogs.
$FC^2$ provides a number of the most popular training datasets via AWS Elastic File System.
https://code.google.com/archive/p/cuda-convnet.
EC2 only mentions that the network performance of these instance types is classified as High. More information is available from https://aws.amazon.com/ec2/instance-types/.
Similar results have also been obtained for larger cluster sizes.
The parameter p in the Scala-Opt algorithms is set to 0.95 to avoid bandwidth saturation at the parameter server, which has a capacity of around 450Mbps.

References

Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE (2016)
Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. OSDI 14, 583–598 (2014)
Google Scholar
Ulanov, A., Simanovsky, A., Marwah, M.: Modeling scalability of distributed machine learning. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 1249–1254. IEEE (2017)
Yan, F., Ruwase, O., He, Y., Chilimbi, T.: Performance modeling and scalability optimization of distributed deep learning systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1355–1364. ACM (2015)
Amazon Machine Learning. https://aws.amazon.com/machine-learning. August 2018
Microsoft Azure Machine Learning Studio. https://studio.azureml.net. August 2018
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., Zhang, Z.: Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM (2014)
Chilimbi, T.M., Suzue, Y., Apacible, J., Kalyanaraman, K.: Project adam: building an efficient and scalable deep learning training system. OSDI 14, 571–582 (2014)
Google Scholar
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)
Article Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Article Google Scholar
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017)
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, Vol. 1, p. 3 (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. IEEE (2017)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Wang, W., Chen, G., Chen, H., Dinh, T.T.A., Gao, J., Ooi, B.C., Tan, K.L., Wang, S., Zhang, M.: Deep learning at scale and at ease. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 12(4s), 69 (2016)
Google Scholar
Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)
Article Google Scholar
Watcharapichat, P., Morales, V.L., Fernandez, R.C., Pietzuch, P.: Ako: Decentralised deep learning with partial gradient exchange. In: Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 84–97. ACM (2016)
Jonas, E., Pu, Q., Venkataraman, S., Stoica, I., Recht, B.: Occupy the cloud: distributed computing for the 99%. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 445–451. ACM (2017)
Duong, T.N.B., Zhong, J., Cai, W., Li, Z., Zhou, S.: Ra2: Predicting simulation execution time for cloud-based design space explorations. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real-Time Applications, pp. 120–127. IEEE Press (2016)
Yan, F., Ruwase, O., He, Y., Smirni, E.: Serf: efficient scheduling for fast deep neural network serving via judicious parallelism. In: SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 300–311. IEEE (2016)
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799 (2018)
Oyama, Y., Nomura, A., Sato, I., Nishimura, H., Tamatsu, Y., Matsuoka, S.: Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 66–75. IEEE (2016)
Li, A., Zong, X., Kandula, S., Yang, X., Zhang, M.: Cloudprophet: towards application performance prediction in cloud. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 426–427. ACM (2011)
Cunha, M., Mendonça, N., Sampaio, A.: Cloud crawler: a declarative performance evaluation environment for infrastructure-as-a-service clouds. Concurr. Comput. Pract. Exp. 29(1), e3825 (2017)
Article Google Scholar
Li, H.W., Wu, Y.S., Chen, Y.Y., Wang, C.M., Huang, Y.N.: Application execution time prediction for effective cpu provisioning in virtualization environment. IEEE Trans. Parallel Distrib. Syst. 28(11), 3074–3088 (2017)
Article Google Scholar
Evangelinou, A., Ciavotta, M., Ardagna, D., Kopaneli, A., Kousiouris, G., Varvarigou, T.: Enterprise applications cloud rightsizing through a joint benchmarking and optimization approach. Future Gener. Comput. Syst. 78, 102–114 (2018)
Article Google Scholar
Cui, H., Cipar, J., Ho, Q., Kim, J.K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G.R., Gibbons, P.B., et al.: Exploiting bounded staleness to speed up big data analytics. In: USENIX Annual Technical Conference, pp. 37–48 (2014)
Sun, P., Wen, Y., Duong, T.N.B., Yan, S.: Timed dataflow: Reducing communication overhead for distributed machine learning systems. In: 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pp. 1110–1117. IEEE (2016)
Sun, P., Wen, Y., Ta, N.B.D., Yan, S.: Towards distributed machine learning in shared clusters: a dynamically-partitioned approach. In: 2017 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 1–6. IEEE (2017)
Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., Li, H.: Terngrad: Ternary gradients to reduce communication in distributed deep learning. In: Advances in Neural Information Processing Systems, pp. 1509–1519 (2017)
Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Lin, Y., Han, S., Mao, H., Wang, Y., Dally, W.J.: Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017)
Peng, Y., Bao, Y., Chen, Y., Wu, C., Guo, C.: Optimus: an efficient dynamic resource scheduler for deep learning clusters. In: Proceedings of the Thirteenth EuroSys Conference. ACM (2018)
Google Cloud AI. https://cloud.google.com/products/ai. August 2018
BigML. https://bigml.com. August 2018
Amazon Deep Learning AMIs. https://aws.amazon.com/machine-learning/amis. August 2018
AWS CloudFormation. https://aws.amazon.com/cloudformation. August 2018
Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)
Article Google Scholar
Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: Advances in Neural Information Processing Systems, pp. 19–27 (2014)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto (2009)

Download references

Acknowledgements

The research has been supported via the Academic Research Fund (AcRF) Tier 1 Grant RG121/15.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798, Singapore
Ta Nguyen Binh Duong

Authors

Ta Nguyen Binh Duong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ta Nguyen Binh Duong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ta, N.B.D. $FC^{2}$: cloud-based cluster provisioning for distributed machine learning. Cluster Comput 22, 1299–1315 (2019). https://doi.org/10.1007/s10586-019-02912-6

Download citation

Received: 25 September 2018
Revised: 13 January 2019
Accepted: 25 January 2019
Published: 08 February 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10586-019-02912-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

\(FC^{2}\): cloud-based cluster provisioning for distributed machine learning

Abstract

Access this article

Similar content being viewed by others

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

How Deep Learning Model Architecture and Software Stack Impacts Training Performance in the Cloud

An Approach to Automatic Performance Prediction for Cloud-Enhanced Mobile Applications with Sparse Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

\(FC^{2}\): cloud-based cluster provisioning for distributed machine learning

Abstract

Access this article

Similar content being viewed by others

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

How Deep Learning Model Architecture and Software Stack Impacts Training Performance in the Cloud

An Approach to Automatic Performance Prediction for Cloud-Enhanced Mobile Applications with Sparse Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation