Dtree: Dynamic Task Scheduling at Petascale

  • Kiran PamnanyEmail author
  • Sanchit Misra
  • Vasimuddin Md.
  • Xing Liu
  • Edmond Chow
  • Srinivas Aluru
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9137)


Irregular applications are challenging to scale on supercomputers due to the difficulty of balancing load across large numbers of nodes. This challenge is exacerbated by the increasing heterogeneity of modern supercomputers in which nodes often contain multiple processors and coprocessors operating at different speeds, and with differing core and thread counts. We present Dtree, a dynamic task scheduler designed to address this challenge. Dtree shows close to optimal results for a class of HPC applications, improving time-to-solution by achieving near-perfect load balance while consuming negligible resources. We demonstrate Dtree’s effectiveness on up to 77,824 heterogeneous cores of the TACC Stampede supercomputer with two different petascale HPC applications: ParaBLe, which performs large-scale Bayesian network structure learning, and GTFock, which implements Fock matrix construction, an essential and expensive step in quantum chemistry codes. For ParaBLe, we show improved performance while eliminating the complexity of managing heterogeneity. For GTFock, we match the most recently published performance without using any application-specific optimizations for data access patterns (such as the task distribution design for communication reduction) that enabled that performance. We also show that Dtree can distribute from tens of thousands to hundreds of millions of irregular tasks across up to 1024 nodes with minimal overhead, while balancing load to within 2 % of optimal.


Petascale Dynamic scheduling Load balance 



The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL:


  1. 1.
    \({\rm {Intel^{\textregistered }}}\) math kernel library MKL.
  2. 2.
    Upc consortium. upc language specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)Google Scholar
  3. 3.
    TACC Stampede supercomputer (2014).
  4. 4.
    Global arrays webpage (2015).
  5. 5.
  6. 6.
  7. 7.
    Bhatele, A., Kumar, S., Mei, C., Phillips, J., Zheng, G., Kale, L.: Overcoming scaling challenges in biomolecular simulations across multiple platforms. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–12, April 2008Google Scholar
  8. 8.
    Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999). zbMATHMathSciNetCrossRefGoogle Scholar
  9. 9.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 519–538. ACM, New York (2005).
  10. 10.
    Chickering, D.M., Heckerman, D., Geiger, D.: Learning Bayesian networks is NP-hard. Technical report MSR-TR-94-17, Microsoft Research (1994)Google Scholar
  11. 11.
    Devine, K.D., Boman, E.G., Heaphy, R.T., Hendrickson, B.A., Teresco, J.D., Faik, J., Flaherty, J.E., Gervasio, L.G.: New challenges in dynamic load balancing. Appl. Numer. Math. 52(2–3), 133–152 (2005). zbMATHMathSciNetCrossRefGoogle Scholar
  12. 12.
    Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 53:1–53:11. ACM, New York (2009).
  13. 13.
    Guo, Y., Zhao, J., Cave, V., Sarkar, V.: Slaw: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010, pp. 341–342. ACM, New York (2010).
  14. 14.
    Janssen, C.L., Nielsen, I.M.: Parallel Computing in Quantum Chemistry. CRC Press, Boca Raton (2008)Google Scholar
  15. 15.
    Kwok, Y.K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999). CrossRefGoogle Scholar
  16. 16.
    Lifflander, J., Krishnamoorthy, S., Kale, L.V.: Work stealing and persistence-based load balancers for iterative overdecomposed applications. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 137–148. ACM, New York (2012).
  17. 17.
    Liu, X., Patel, A., Chow, E.: A new scalable parallel algorithm for Fock matrix construction. In: 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), Phoenix, AZ (2014)Google Scholar
  18. 18.
    Lotrich, V., Flocke, N., Ponton, M., Yau, A., Perera, A., Deumens, E., Bartlett, R.: Parallel implementation of electronic structure energy, gradient, and hessian calculations. J. Chem. Phys. 128, 194104 (2008)CrossRefGoogle Scholar
  19. 19.
    Lusk, E.L., Pieper, S.C., Butler, R.M., et al.: More scalability, less pain: a simple programming model and its implementation for extreme computing. SciDAC Rev. 17(1), 30–37 (2010)Google Scholar
  20. 20.
    Menon, H., Kalé, L.: A distributed dynamic load balancer for iterative applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 15:1–15:11. ACM, New York (2013).
  21. 21.
    Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space Programming Models (2011)Google Scholar
  22. 22.
    Misra, S., Vasimuddin, M., Pamnany, K., Chockalingam, S., Dong, Y., Xie, M., Aluru, M., Aluru, S.: Parallel Bayesian network structure learning for genome-scale gene networks. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC14, pp. 461–472, November 2014Google Scholar
  23. 23.
    Nikolova, O., Aluru, S.: Parallel Bayesian network structure learning with application to gene networks. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 63:1–63:9 (2012)Google Scholar
  24. 24.
    Saraswat, V.A., Kambadur, P., Kodali, S., Grove, D., Krishnamoorthy, S.: Lifeline-based global load balancing. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP 2011, pp. 201–212. ACM, New York (2011).
  25. 25.
    Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A., Su, S., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993)CrossRefGoogle Scholar
  26. 26.
    Valiev, M., Bylaska, E.J., Govind, N., Kowalski, K., Straatsma, T.P., Van Dam, H.J., Wang, D., Nieplocha, J., Apra, E., Windus, T.L., et al.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)zbMATHCrossRefGoogle Scholar
  27. 27.
    Zheng, G., Bhatelé, A., Meneses, E., Kalé, L.V.: Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl. 25(4), 371–385 (2011). CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Kiran Pamnany
    • 1
    Email author
  • Sanchit Misra
    • 1
  • Vasimuddin Md.
    • 2
  • Xing Liu
    • 3
  • Edmond Chow
    • 4
  • Srinivas Aluru
    • 4
  1. 1.Parallel Computing LabIntel CorporationBangaloreIndia
  2. 2.Department of Computer Science and EngineeringIndian Institute of Technology BombayMumbaiIndia
  3. 3.IBM T.J. Watson Research CenterYorktown HeightsUSA
  4. 4.School of Computational Science and EngineeringGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations