Skip to main content

Dtree: Dynamic Task Scheduling at Petascale

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9137))

Abstract

Irregular applications are challenging to scale on supercomputers due to the difficulty of balancing load across large numbers of nodes. This challenge is exacerbated by the increasing heterogeneity of modern supercomputers in which nodes often contain multiple processors and coprocessors operating at different speeds, and with differing core and thread counts. We present Dtree, a dynamic task scheduler designed to address this challenge. Dtree shows close to optimal results for a class of HPC applications, improving time-to-solution by achieving near-perfect load balance while consuming negligible resources. We demonstrate Dtree’s effectiveness on up to 77,824 heterogeneous cores of the TACC Stampede supercomputer with two different petascale HPC applications: ParaBLe, which performs large-scale Bayesian network structure learning, and GTFock, which implements Fock matrix construction, an essential and expensive step in quantum chemistry codes. For ParaBLe, we show improved performance while eliminating the complexity of managing heterogeneity. For GTFock, we match the most recently published performance without using any application-specific optimizations for data access patterns (such as the task distribution design for communication reduction) that enabled that performance. We also show that Dtree can distribute from tens of thousands to hundreds of millions of irregular tasks across up to 1024 nodes with minimal overhead, while balancing load to within 2 % of optimal.

X. Liu—During this research, Xing Liu was affiliated with Georgia Institute of Technology.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.

  2. 2.

    We refer to Dtree tasks as work items in this application.

  3. 3.

    We refer to Dtree tasks as work items, to prevent confusion with GTFock tasks.

  4. 4.

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

References

  1. \({\rm {Intel^{\textregistered }}}\) math kernel library MKL. http://software.intel.com/en-us/intel-mkl

  2. Upc consortium. upc language specifications, v1.2. Technical report LBNL-59208, Lawrence Berkeley National Lab (2005)

    Google Scholar 

  3. TACC Stampede supercomputer (2014). http://top500.org/system/177931

  4. Global arrays webpage (2015). http://hpc.pnl.gov/globalarrays/

  5. Intel mpi on intel xeon phi coprocessor systems (2015). https://software.intel.com/en-us/articles/using-the-intel-mpi-library-on-intel-xeon-phi-coprocessor-systems

  6. Mvapich: Performance (2015). http://mvapich.cse.ohio-state.edu/performance/pt_to_pt/

  7. Bhatele, A., Kumar, S., Mei, C., Phillips, J., Zheng, G., Kale, L.: Overcoming scaling challenges in biomolecular simulations across multiple platforms. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pp. 1–12, April 2008

    Google Scholar 

  8. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999). http://doi.acm.org/10.1145/324133.324234

    Article  MATH  MathSciNet  Google Scholar 

  9. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 519–538. ACM, New York (2005). http://doi.acm.org/10.1145/1094811.1094852

  10. Chickering, D.M., Heckerman, D., Geiger, D.: Learning Bayesian networks is NP-hard. Technical report MSR-TR-94-17, Microsoft Research (1994)

    Google Scholar 

  11. Devine, K.D., Boman, E.G., Heaphy, R.T., Hendrickson, B.A., Teresco, J.D., Faik, J., Flaherty, J.E., Gervasio, L.G.: New challenges in dynamic load balancing. Appl. Numer. Math. 52(2–3), 133–152 (2005). http://dx.doi.org/10.1016/j.apnum.2004.08.028

    Article  MATH  MathSciNet  Google Scholar 

  12. Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 2009, pp. 53:1–53:11. ACM, New York (2009). http://doi.acm.org/10.1145/1654059.1654113

  13. Guo, Y., Zhao, J., Cave, V., Sarkar, V.: Slaw: A scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010, pp. 341–342. ACM, New York (2010). http://doi.acm.org/10.1145/1693453.1693504

  14. Janssen, C.L., Nielsen, I.M.: Parallel Computing in Quantum Chemistry. CRC Press, Boca Raton (2008)

    Google Scholar 

  15. Kwok, Y.K., Ahmad, I.: Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999). http://doi.acm.org/10.1145/344588.344618

    Article  Google Scholar 

  16. Lifflander, J., Krishnamoorthy, S., Kale, L.V.: Work stealing and persistence-based load balancers for iterative overdecomposed applications. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2012, pp. 137–148. ACM, New York (2012). http://doi.acm.org/10.1145/2287076.2287103

  17. Liu, X., Patel, A., Chow, E.: A new scalable parallel algorithm for Fock matrix construction. In: 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), Phoenix, AZ (2014)

    Google Scholar 

  18. Lotrich, V., Flocke, N., Ponton, M., Yau, A., Perera, A., Deumens, E., Bartlett, R.: Parallel implementation of electronic structure energy, gradient, and hessian calculations. J. Chem. Phys. 128, 194104 (2008)

    Article  Google Scholar 

  19. Lusk, E.L., Pieper, S.C., Butler, R.M., et al.: More scalability, less pain: a simple programming model and its implementation for extreme computing. SciDAC Rev. 17(1), 30–37 (2010)

    Google Scholar 

  20. Menon, H., Kalé, L.: A distributed dynamic load balancer for iterative applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 15:1–15:11. ACM, New York (2013). http://doi.acm.org/10.1145/2503210.2503284

  21. Min, S.J., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: 5th Conference on Partitioned Global Address Space Programming Models (2011)

    Google Scholar 

  22. Misra, S., Vasimuddin, M., Pamnany, K., Chockalingam, S., Dong, Y., Xie, M., Aluru, M., Aluru, S.: Parallel Bayesian network structure learning for genome-scale gene networks. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC14, pp. 461–472, November 2014

    Google Scholar 

  23. Nikolova, O., Aluru, S.: Parallel Bayesian network structure learning with application to gene networks. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 63:1–63:9 (2012)

    Google Scholar 

  24. Saraswat, V.A., Kambadur, P., Kodali, S., Grove, D., Krishnamoorthy, S.: Lifeline-based global load balancing. In: Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP 2011, pp. 201–212. ACM, New York (2011). http://doi.acm.org/10.1145/1941553.1941582

  25. Schmidt, M.W., Baldridge, K.K., Boatz, J.A., Elbert, S.T., Gordon, M.S., Jensen, J.H., Koseki, S., Matsunaga, N., Nguyen, K.A., Su, S., et al.: General atomic and molecular electronic structure system. J. Comput. Chem. 14(11), 1347–1363 (1993)

    Article  Google Scholar 

  26. Valiev, M., Bylaska, E.J., Govind, N., Kowalski, K., Straatsma, T.P., Van Dam, H.J., Wang, D., Nieplocha, J., Apra, E., Windus, T.L., et al.: NWChem: a comprehensive and scalable open-source solution for large scale molecular simulations. Comput. Phys. Commun. 181(9), 1477–1489 (2010)

    Article  MATH  Google Scholar 

  27. Zheng, G., Bhatelé, A., Meneses, E., Kalé, L.V.: Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl. 25(4), 371–385 (2011). http://dx.doi.org/10.1177/1094342010394383

    Article  Google Scholar 

Download references

Acknowledgements

The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. URL: http://www.tacc.utexas.edu.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kiran Pamnany .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Pamnany, K., Misra, S., Md., V., Liu, X., Chow, E., Aluru, S. (2015). Dtree: Dynamic Task Scheduling at Petascale. In: Kunkel, J., Ludwig, T. (eds) High Performance Computing. ISC High Performance 2015. Lecture Notes in Computer Science(), vol 9137. Springer, Cham. https://doi.org/10.1007/978-3-319-20119-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20119-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20118-4

  • Online ISBN: 978-3-319-20119-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics