Abstract
Wide-area data transfer is central to geographically distributed scientific workflows. Faster delivery of data is important for these workflows. Predictability is equally (or even more) important. With the goal of providing a reasonably accurate estimate of data transfer time to improve resource allocation & scheduling for workflows and enable end-to-end data transfer optimization, we apply machine learning methods to develop predictive models for data transfer times over a variety of wide area networks. To build and evaluate these models, we use 201,388 transfers, involving 759 million files totaling 9 PB transferred, over 115 heavily used source-destination pairs (“edges”) between 135 unique endpoints. We evaluate the models for different retraining frequencies and different window size of history data. In the best case, the resulting models have a median prediction error of \(\le \)21% for 50% of the edges, and \(\le \)32% for 75% of the edges. We present a detailed analysis of these results that provides insights into the cause of some of the high errors. We envision that the performance predictor will be informative for scheduling geo-distributed workflows. The insights also suggest obvious directions for both further analysis and transfer service optimization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kettimuthu, R., Agrawal, G., Sadayappan, P., Foster, I.: Differentiated scheduling of response-critical and best-effort wide-area data transfers. In: 2016 IEEE International Parallel and Distributed Processing Symposium, pp. 1113–1122, May 2016
Allcock, W., et al.: Data management and transfer in high-performance computational grid environments. Parallel Comput. 28(5), 749–771 (2002). https://doi.org/10.1016/S0167-8191(02)00094-7
Kettimuthu, R., Liu, Z., Wheeler, D., Foster, I., Heitmann, K., Cappello, F.: Transferring a petabyte in a day. Future Gener. Comput. Syst. 88, 191–198 (2018). https://doi.org/10.1016/j.future.2018.05.051
Stavrinides, G.L., Duro, F.R., Karatza, H.D., Blas, J.G., Carretero, J.: Different aspects of workflow scheduling in large-scale distributed systems. Simul. Model. Pract. Theory 70, 120–134 (2017). https://doi.org/10.1016/j.simpat.2016.10.009
Liu, Z., Kettimuthu, R., Leyffer, S., Palkar, P., Foster, I.: A mathematical programming- and simulation-based framework to evaluate cyberinfrastructure design choices. In: IEEE 13th International Conference on e-Science, October 2017, pp. 148–157 (2017). https://doi.org/10.1109/eScience.2017.27
Bicer, T., Gürsoy, D., Kettimuthu, R., De Carlo, F., Foster, I.T.: Optimization of tomographic reconstruction workflows on geographically distributed resources. J. Synchrotron Radiat. 23(4), 997–1005 (2016)
Kettimuthu, R., et al.: Toward autonomic science infrastructure: architecture, limitations, and open issues. In: The 1st Autonomous Infrastructure for Science Workshop, AI-Science 2018. ACM, New York (2018). https://doi.org/10.1145/3217197.3217205
Rao, N.S.V., Liu, Q., Liu, Z., Kettimuthu, R., Foster, I.: Throughput analytics of data transfer infrastructures. In: Gao, H., Yin, Y., Yang, X., Miao, H. (eds.) TridentCom 2018. LNICST, vol. 270, pp. 20–40. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12971-2_2
Kettimuthu, R., Vardoyan, G., Agrawal, G., Sadayappan, P., Foster, I.: An elegant sufficiency: load-aware differentiated scheduling of data transfers. In: SC15: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2015
Vazhkudai, S.: Enabling the co-allocation of grid data transfers. In: Proceedings of First Latin American Web Congress, pp. 44–51, November 2003
Wei, D.X., Jin, C., Low, S.H., Hegde, S.: FAST TCP: motivation, architecture, algorithms, performance. IEEE/ACM Trans. Netw. 14(6), 1246–1259 (2006)
Tierney, B., Johnston, W., Crowley, B., Hoo, G., Brooks, C., Gunter, D.: The NetLogger methodology for high performance distributed systems performance analysis. In: 7th International Symposium on High Performance Distributed Computing, pp. 260–267. IEEE (1998)
Kosar, T., Kola, G., Livny, M.: Data pipelines: enabling large scale multi-protocol data transfers. In: 2nd Workshop on Middleware for Grid Computing, pp. 63–68 (2004)
Kelly, T.: Scalable TCP: improving performance in highspeed wide area networks. ACM SIGCOMM Comput. Commun. Rev. 33(2), 83–91 (2003)
Wolski, R.: Forecasting network performance to support dynamic scheduling using the Network Weather Service. In: 6th IEEE Symposium on High Performance Distributed Computing, Portland, Oregon (1997)
Hacker, T.J., Athey, B.D., Noble, B.: The end-to-end performance effects of parallel TCP sockets on a lossy wide-area network. In: 16th International Parallel and Distributed Processing Symposium, IPDPS 2002, p. 314. IEEE Computer Society, Washington, DC (2002). http://dl.acm.org/citation.cfm?id=645610.661894
Rao, N.S.V., Sen, S., Liu, Z., Kettimuthu, R., Foster, I.: Learning concave-convex profiles of data transport over dedicated connections. In: Renault, É., Mühlethaler, P., Boumerdassi, S. (eds.) MLN 2018. LNCS, vol. 11407, pp. 1–22. Springer, Cham (2019)
Liu, Z., Balaprakash, P., Kettimuthu, R., Foster, I.: Explaining wide area data transfer performance. In: 26th ACM Symposium on High-Performance Parallel and Distributed Computing (2017)
Allcock, W., et al.: The Globus striped GridFTP framework and server. In: SC, Washington, DC, USA, pp. 54–61 (2005)
www.slac.stanford.edu/abh/bbcp/, BBCP (2017). http://www.slac.stanford.edu/~abh/bbcp/. Accessed 3 Jan 2017
FDT: FDT - Fast Data Transfer. http://monalisa.cern.ch/FDT/. Accessed Apr 2017
Settlemyer, B.W., Dobson, J.D., Hodson, S.W., Kuehn, J.A., Poole, S.W., Ruwart, T.M.: A technique for moving large data sets over high-performance long distance networks. In: 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–6, May 2011
Chard, K., Tuecke, S., Foster, I.: Globus: recent enhancements and future plans. In: XSEDE 2016 Conference on Diversity, Big Data, and Science at Scale, p. 27. ACM (2016)
Deelman, E., et al.: Pegasus: a workflow management system for science automation. Future Gener. Comput. Syst. 46, 17–35 (2015)
Arslan, E., Guner, K., Kosar, T.: Harp: predictive transfer optimization based on historical analysis and real-time probing. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, pp. 288–299, November 2016
Liu, Z., Kettimuthu, R., Foster, I., Beckman, P.H.: Towards a smart data transfer node. Future Gener. Comput. Syst. 89, 10–18 (2018)
Arslan, E., Guner, K., Kosar, T.: HARP: predictive transfer optimization based on historical analysis and real-time probing. In: SC, Piscataway, NJ, USA, pp. 25:1–25:12 (2016). http://dl.acm.org/citation.cfm?id=3014904.3014938
Arslan, E., Kosar, T.: A heuristic approach to protocol tuning for high performance data transfers, ArXiv e-prints, August 2017
Kim, J., Yildirim, E., Kosar, T.: A highly-accurate and low-overhead prediction model for transfer throughput optimization. Clust. Comput. 18(1), 41–59 (2015)
www.maxmind.com: MaxMind: IP Geolocation and Online Fraud Prevention (2017). https://www.maxmind.com. Accessed 3 Apr 2017
Maclin, R., Opitz, D.W.: Popular ensemble methods: an empirical study, CoRR, vol. abs/1106.0257 (2011). http://arxiv.org/abs/1106.0257
Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Ho, T.K.: Random decision forests. In: 3rd International Conference on Document Analysis and Recognition, ICDAR 1995, pp. 278–282. IEEE (1995). http://dl.acm.org/citation.cfm?id=844379.844681
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-59119-2_166
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system, arXiv preprint arXiv:1603.02754 (2016)
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012). http://dl.acm.org/citation.cfm?id=2188385.2188395
Vazhkudai, S., Schopf, J.M., Foster, I.: Predicting the performance of wide area data transfers. In: International Parallel and Distributed Processing Symposium, 10-pp. IEEE (2001)
Swany, M., Wolski, R.: Multivariate resource performance forecasting in the Network Weather Service. In: Supercomputing Conference, p. 11. IEEE (2002)
Lu, D., Qiao, Y., Dinda, P.A., Bustamante, F.E.: Characterizing and predicting TCP throughput on the wide area network. In: 25th IEEE International Conference on Distributed Computing Systems, pp. 414–424. IEEE (2005)
He, Q., Dovrolis, C., Ammar, M.: On the predictability of large transfer TCP throughput. Comput. Netw. 51(14), 3959–3977 (2007)
Huang, T.-i., Subhlok, J.: Fast pattern-based throughput prediction for TCP bulk transfers. In: International Symposium on Cluster Computing and the Grid, vol. 1, pp. 410–417. IEEE (2005)
Shah, S.M.H., ur Rehman, A., Khan, A.N., Shah, M.A.: TCP throughput estimation: a new neural networks model. In: International Conference on Emerging Technologies, pp. 94–98. IEEE (2007)
Mirza, M., Sommers, J., Barford, P., Zhu, X.: A machine learning approach to TCP throughput prediction. IEEE/ACM Trans. Netw. 18(4), 1026–1039 (2010)
Kettimuthu, R., Vardoyan, G., Agrawal, G., Sadayappan, P.: Modeling and optimizing large-scale wide-area data transfers. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 196–205. IEEE (2014)
Nine, M., Guner, K., Kosar, T.: Hysteresis-based optimization of data transfer throughput. In: 5th International Workshop on Network-Aware Data Management, p. 5. ACM (2015)
Hours, H., Biersack, E., Loiseau, P.: A causal approach to the study of TCP performance. ACM Trans. Intell. Syst. Technol. (TIST) 7(2), 25 (2016)
Liu, Z., Kettimuthu, R., Foster, I., Rao, N.S.V.: Cross-geography scientific data transferring trends and behavior. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2018, pp. 267–278. ACM, New York (2018). https://doi.org/10.1145/3208040.3208053
Liu, Z., Kettimuthu, R., Foster, I., Liu, Y.: A comprehensive study of wide area data movement at a scientific computing facility. In: IEEE International Conference on Distributed Computing Systems. Scalable Network Traffic Analytics. IEEE (2018)
Rao, N., Liu, Q., Sen, S., Liu, Z., Kettimuthu, R., Foster, I.: Measurements and analytics of wide-area file transfers over dedicated connections. In: 20th International Conference on Distributed Computing and Networking. ACM (2019)
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Z., Kettimuthu, R., Balaprakash, P., Rao, N.S.V., Foster, I. (2019). Building a Wide-Area File Transfer Performance Predictor: An Empirical Study. In: Renault, É., Mühlethaler, P., Boumerdassi, S. (eds) Machine Learning for Networking. MLN 2018. Lecture Notes in Computer Science(), vol 11407. Springer, Cham. https://doi.org/10.1007/978-3-030-19945-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-19945-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19944-9
Online ISBN: 978-3-030-19945-6
eBook Packages: Computer ScienceComputer Science (R0)