Abstract
Irregular applications are challenging to scale on supercomputers due to the difficulty of balancing load across large numbers of nodes. This challenge is exacerbated by the increasing heterogeneity of modern supercomputers in which nodes often contain multiple processors and coprocessors operating at different speeds, and with differing core and thread counts. We present Dtree, a dynamic task scheduler designed to address this challenge. Dtree shows close to optimal results for a class of HPC applications, improving time-to-solution by achieving near-perfect load balance while consuming negligible resources. We demonstrate Dtree’s effectiveness on up to 77,824 heterogeneous cores of the TACC Stampede supercomputer with two different petascale HPC applications: ParaBLe, which performs large-scale Bayesian network structure learning, and GTFock, which implements Fock matrix construction, an essential and expensive step in quantum chemistry codes. For ParaBLe, we show improved performance while eliminating the complexity of managing heterogeneity. For GTFock, we match the most recently published performance without using any application-specific optimizations for data access patterns (such as the task distribution design for communication reduction) that enabled that performance. We also show that Dtree can distribute from tens of thousands to hundreds of millions of irregular tasks across up to 1024 nodes with minimal overhead, while balancing load to within 2 % of optimal.
X. Liu—During this research, Xing Liu was affiliated with Georgia Institute of Technology.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
- 2.
We refer to Dtree tasks as work items in this application.
- 3.
We refer to Dtree tasks as work items, to prevent confusion with GTFock tasks.
- 4.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.
References
\({\rm {Intel^{\textregistered }}}\) math kernel library MKL. http://software.intel.com/en-us/intel-mkl
Upc consortium. upc language specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)
TACC Stampede supercomputer (2014). http://top500.org/system/177931
Global arrays webpage (2015). http://hpc.pnl.gov/globalarrays/
Intel mpi on intel xeon phi coprocessor systems (2015). https://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems
Mvapich: Performance (2015). http://mvapich.cse.ohio-state.edu/performance/pt_to_pt/
Bhatele, A., Kumar, S., Mei, C., Phillips, J., Zheng, G., Kale, L.: Overcoming scaling challenges in biomolecular simulations across multiple platforms. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–12, April 2008
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999). http://doi.acm.org/10.1145/324133.324234
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 519–538. ACM, New York (2005). http://doi.acm.org/10.1145/1094811.1094852
Chickering, D.M., Heckerman, D., Geiger, D.: Learning Bayesian networks is NP-hard. Technical report MSR-TR-94-17, Microsoft Research (1994)
Devine, K.D., Boman, E.G., Heaphy, R.T., Hendrickson, B.A., Teresco, J.D., Faik, J., Flaherty, J.E., Gervasio, L.G.: New challenges in dynamic load balancing. Appl. Numer. Math. 52(2–3), 133–152 (2005). http://dx.doi.org/10.1016/j.apnum.2004.08.028
Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 53:1–53:11. ACM, New York (2009). http://doi.acm.org/10.1145/1654059.1654113
Guo, Y., Zhao, J., Cave, V., Sarkar, V.: Slaw: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010, pp. 341–342. ACM, New York (2010). http://doi.acm.org/10.1145/1693453.1693504
Janssen, C.L., Nielsen, I.M.: Parallel Computing in Quantum Chemistry. CRC Press, Boca Raton (2008)
Kwok, Y.K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999). http://doi.acm.org/10.1145/344588.344618
Lifflander, J., Krishnamoorthy, S., Kale, L.V.: Work stealing and persistence-based load balancers for iterative overdecomposed applications. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 137–148. ACM, New York (2012). http://doi.acm.org/10.1145/2287076.2287103
Liu, X., Patel, A., Chow, E.: A new scalable parallel algorithm for Fock matrix construction. In: 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), Phoenix, AZ (2014)
Lotrich, V., Flocke, N., Ponton, M., Yau, A., Perera, A., Deumens, E., Bartlett, R.: Parallel implementation of electronic structure energy, gradient, and hessian calculations. J. Chem. Phys. 128, 194104 (2008)
Lusk, E.L., Pieper, S.C., Butler, R.M., et al.: More scalability, less pain: a simple programming model and its implementation for extreme computing. SciDAC Rev. 17(1), 30–37 (2010)
Menon, H., Kalé, L.: A distributed dynamic load balancer for iterative applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 15:1–15:11. ACM, New York (2013). http://doi.acm.org/10.1145/2503210.2503284
Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space Programming Models (2011)
Misra, S., Vasimuddin, M., Pamnany, K., Chockalingam, S., Dong, Y., Xie, M., Aluru, M., Aluru, S.: Parallel Bayesian network structure learning for genome-scale gene networks. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC14, pp. 461–472, November 2014
Nikolova, O., Aluru, S.: Parallel Bayesian network structure learning with application to gene networks. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 63:1–63:9 (2012)
Saraswat, V.A., Kambadur, P., Kodali, S., Grove, D., Krishnamoorthy, S.: Lifeline-based global load balancing. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP 2011, pp. 201–212. ACM, New York (2011). http://doi.acm.org/10.1145/1941553.1941582
Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A., Su, S., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993)
Valiev, M., Bylaska, E.J., Govind, N., Kowalski, K., Straatsma, T.P., Van Dam, H.J., Wang, D., Nieplocha, J., Apra, E., Windus, T.L., et al.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)
Zheng, G., Bhatelé, A., Meneses, E., Kalé, L.V.: Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl. 25(4), 371–385 (2011). http://dx.doi.org/10.1177/1094342010394383
Acknowledgements
The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pamnany, K., Misra, S., Md., V., Liu, X., Chow, E., Aluru, S. (2015). Dtree: Dynamic Task Scheduling at Petascale. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-20119-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)