Dtree: Dynamic Task Scheduling at Petascale

Pamnany, Kiran; Misra, Sanchit; Md., Vasimuddin; Liu, Xing; Chow, Edmond; Aluru, Srinivas

doi:10.1007/978-3-319-20119-1_10

Dtree: Dynamic Task Scheduling at Petascale

Kiran Pamnany¹⁵,
Sanchit Misra¹⁵,
Vasimuddin Md.¹⁶,
Xing Liu¹⁷,
Edmond Chow¹⁸ &
…
Srinivas Aluru¹⁸

Conference paper
First Online: 01 January 2015

2843 Accesses
1 Citations
1 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Abstract

Irregular applications are challenging to scale on supercomputers due to the difficulty of balancing load across large numbers of nodes. This challenge is exacerbated by the increasing heterogeneity of modern supercomputers in which nodes often contain multiple processors and coprocessors operating at different speeds, and with differing core and thread counts. We present Dtree, a dynamic task scheduler designed to address this challenge. Dtree shows close to optimal results for a class of HPC applications, improving time-to-solution by achieving near-perfect load balance while consuming negligible resources. We demonstrate Dtree’s effectiveness on up to 77,824 heterogeneous cores of the TACC Stampede supercomputer with two different petascale HPC applications: ParaBLe, which performs large-scale Bayesian network structure learning, and GTFock, which implements Fock matrix construction, an essential and expensive step in quantum chemistry codes. For ParaBLe, we show improved performance while eliminating the complexity of managing heterogeneity. For GTFock, we match the most recently published performance without using any application-specific optimizations for data access patterns (such as the task distribution design for communication reduction) that enabled that performance. We also show that Dtree can distribute from tens of thousands to hundreds of millions of irregular tasks across up to 1024 nodes with minimal overhead, while balancing load to within 2 % of optimal.

X. Liu—During this research, Xing Liu was affiliated with Georgia Institute of Technology.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
2.
We refer to Dtree tasks as work items in this application.
3.
We refer to Dtree tasks as work items, to prevent confusion with GTFock tasks.
4.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

References

\({\rm {Intel^{\textregistered }}}\) math kernel library MKL. http://software.intel.com/en-us/intel-mkl
Upc consortium. upc language specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)
Google Scholar
TACC Stampede supercomputer (2014). http://top500.org/system/177931
Global arrays webpage (2015). http://hpc.pnl.gov/globalarrays/
Intel mpi on intel xeon phi coprocessor systems (2015). https://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems
Mvapich: Performance (2015). http://mvapich.cse.ohio-state.edu/performance/pt_to_pt/
Bhatele, A., Kumar, S., Mei, C., Phillips, J., Zheng, G., Kale, L.: Overcoming scaling challenges in biomolecular simulations across multiple platforms. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–12, April 2008
Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999). http://doi.acm.org/10.1145/324133.324234
Article MATH MathSciNet Google Scholar
Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 519–538. ACM, New York (2005). http://doi.acm.org/10.1145/1094811.1094852
Chickering, D.M., Heckerman, D., Geiger, D.: Learning Bayesian networks is NP-hard. Technical report MSR-TR-94-17, Microsoft Research (1994)
Google Scholar
Devine, K.D., Boman, E.G., Heaphy, R.T., Hendrickson, B.A., Teresco, J.D., Faik, J., Flaherty, J.E., Gervasio, L.G.: New challenges in dynamic load balancing. Appl. Numer. Math. 52(2–3), 133–152 (2005). http://dx.doi.org/10.1016/j.apnum.2004.08.028
Article MATH MathSciNet Google Scholar
Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 53:1–53:11. ACM, New York (2009). http://doi.acm.org/10.1145/1654059.1654113
Guo, Y., Zhao, J., Cave, V., Sarkar, V.: Slaw: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010, pp. 341–342. ACM, New York (2010). http://doi.acm.org/10.1145/1693453.1693504
Janssen, C.L., Nielsen, I.M.: Parallel Computing in Quantum Chemistry. CRC Press, Boca Raton (2008)
Google Scholar
Kwok, Y.K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999). http://doi.acm.org/10.1145/344588.344618
Article Google Scholar
Lifflander, J., Krishnamoorthy, S., Kale, L.V.: Work stealing and persistence-based load balancers for iterative overdecomposed applications. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 137–148. ACM, New York (2012). http://doi.acm.org/10.1145/2287076.2287103
Liu, X., Patel, A., Chow, E.: A new scalable parallel algorithm for Fock matrix construction. In: 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), Phoenix, AZ (2014)
Google Scholar
Lotrich, V., Flocke, N., Ponton, M., Yau, A., Perera, A., Deumens, E., Bartlett, R.: Parallel implementation of electronic structure energy, gradient, and hessian calculations. J. Chem. Phys. 128, 194104 (2008)
Article Google Scholar
Lusk, E.L., Pieper, S.C., Butler, R.M., et al.: More scalability, less pain: a simple programming model and its implementation for extreme computing. SciDAC Rev. 17(1), 30–37 (2010)
Google Scholar
Menon, H., Kalé, L.: A distributed dynamic load balancer for iterative applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 15:1–15:11. ACM, New York (2013). http://doi.acm.org/10.1145/2503210.2503284
Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space Programming Models (2011)
Google Scholar
Misra, S., Vasimuddin, M., Pamnany, K., Chockalingam, S., Dong, Y., Xie, M., Aluru, M., Aluru, S.: Parallel Bayesian network structure learning for genome-scale gene networks. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC14, pp. 461–472, November 2014
Google Scholar
Nikolova, O., Aluru, S.: Parallel Bayesian network structure learning with application to gene networks. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 63:1–63:9 (2012)
Google Scholar
Saraswat, V.A., Kambadur, P., Kodali, S., Grove, D., Krishnamoorthy, S.: Lifeline-based global load balancing. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP 2011, pp. 201–212. ACM, New York (2011). http://doi.acm.org/10.1145/1941553.1941582
Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A., Su, S., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993)
Article Google Scholar
Valiev, M., Bylaska, E.J., Govind, N., Kowalski, K., Straatsma, T.P., Van Dam, H.J., Wang, D., Nieplocha, J., Apra, E., Windus, T.L., et al.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)
Article MATH Google Scholar
Zheng, G., Bhatelé, A., Meneses, E., Kalé, L.V.: Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl. 25(4), 371–385 (2011). http://dx.doi.org/10.1177/1094342010394383
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu.

Author information

Authors and Affiliations

Parallel Computing Lab, Intel Corporation, Bangalore, India
Kiran Pamnany & Sanchit Misra
Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India
Vasimuddin Md.
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Xing Liu
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA
Edmond Chow & Srinivas Aluru

Authors

Kiran Pamnany
View author publications
You can also search for this author in PubMed Google Scholar
Sanchit Misra
View author publications
You can also search for this author in PubMed Google Scholar
Vasimuddin Md.
View author publications
You can also search for this author in PubMed Google Scholar
Xing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Edmond Chow
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Aluru
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kiran Pamnany .

Editor information

Editors and Affiliations

Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Julian M. Kunkel
Deutsches Klimarechenzentrum (DKRZ), Hamburg, Germany
Thomas Ludwig

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pamnany, K., Misra, S., Md., V., Liu, X., Chow, E., Aluru, S. (2015). Dtree: Dynamic Task Scheduling at Petascale. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-20119-1_10
Published: 20 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20118-4
Online ISBN: 978-3-319-20119-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics