Asynchronous Nested Parallelism for Dynamic Applications in Distributed Memory

  • Ioannis PapadopoulosEmail author
  • Nathan Thomas
  • Adam Fidel
  • Dielli Hoxha
  • Nancy M. Amato
  • Lawrence Rauchwerger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9519)


Nested parallelism is of increasing interest for both expressivity and performance. Many problems are naturally expressed with this divide-and-conquer software design approach. In addition, programmers with target architecture knowledge employ nested parallelism for performance, imposing a hierarchy in the application to increase locality and resource utilization, often at the cost of implementation complexity.

While dynamic applications are a natural fit for the approach, support for nested parallelism in distributed systems is generally limited to well-structured applications engineered with distinct phases of intra-node computation and inter-node communication. This model makes expressing irregular applications difficult and also hurts performance by introducing unnecessary latency and synchronizations. In this paper we describe an approach to asynchronous nested parallelism which provides uniform treatment of nested computation across distributed memory. This approach allows efficient execution while supporting dynamic applications which cannot be mapped onto the machine in the rigid manner of regular applications. We use several graph algorithms as examples to demonstrate our library’s expressivity, flexibility, and performance.


Nested parallelism Asynchronous Isolation Graph Dynamic applications 



This research is supported in part by NSF awards CNS-0551685, CCF-0702765, CCF-0833199, CCF-1439145, CCF-1423111, CCF-0830753, IIS-0916053, IIS-0917266, EFRI–1240483, RI-1217991, by NIH NCI R25 CA090301-11, by DOE awards DE-AC02-06CH11357, DE-NA0002376, B575363, by Samsung, IBM, Intel, and by Award KUS-C1-016-04, made by King Abdullah University of Science and Technology (KAUST). This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.


  1. 1.
    The grapph 500 list. (2011).
  2. 2.
    Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci. Program. 20(2), 115–128 (2012)Google Scholar
  3. 3.
    Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11, November 2012Google Scholar
  4. 4.
    Blelloch, G.: NESL: A Nested Data-Parallel Language. Technical report CMU-CS-93-129, Carnegie Mellon University (1993)Google Scholar
  5. 5.
    Blumofe, R.D., et al.: Cilk: An efficient multithreaded runtime system. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programme (PPoPP), vol. 30, pp. 207–216. ACM, New York, July 1995Google Scholar
  6. 6.
    Buluç, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 65:1–65:12. ACM, New York (2011)Google Scholar
  7. 7.
    Buss, A., et al.: The STAPL pView. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 261–275. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  8. 8.
    Buss, A., Harshvardhan, Papadopoulos, I., Pearce, O., Smith, T., Tanase, G., Thomas, N., Xu, X., Bianco, M., Amato, N.M., Rauchwerger, L.: STAPL: Standard template adaptive parallel library. In: Proceedings of Annual Haifa Experimental Systems Conference (SYSTOR), pp. 1–10. ACM, New York (2010)Google Scholar
  9. 9.
    Callahan, D., Chamberlain, B.L., Zima, H.P.: The cascade high productivity language. In: The Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, vol. 26, pp. 52–60, Los Alamitos (2004)Google Scholar
  10. 10.
    Cappello, F., Etiemble, D.: MPI versus MPI+OpenMp on IBM SP for the NAS benchmarks. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2000. IEEE Computer Society, Washington, DC (2000)Google Scholar
  11. 11.
    Cavé, V., Zhao, J., Shirako, J., Sarkar, V.: Habanero-Java: The new adventures of old X10. In: Proceedings of the 9th International Conference on Principles and Practice of Programming in Java, PPPJ 2011, pp. 51–61. ACM, New York (2011)Google Scholar
  12. 12.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 519–538. ACM Press, New York (2005)Google Scholar
  13. 13.
    Chitnis, L., et al.: Finding connected components in map-reduce in logarithmic rounds. In: Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE 2013, pp. 50–61. IEEE Computer Society, Washington, DC (2013)Google Scholar
  14. 14.
    Consortium, U.: UPC Language Specifications V1.2, (2005).
  15. 15.
    Duran, A., Silvera, R., Corbalán, J., Labarta, J.: Runtime adjustment of parallel nested loops. In: Chapman, B.M. (ed.) WOMPAT 2004. LNCS, vol. 3349, pp. 137–147. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  16. 16.
    Fatahalian, K., et al.: Sequoia: programming the memory hierarchy. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)Google Scholar
  17. 17.
    Gonzalez, J.E., et al.: Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI 2012, pp. 17–30. USENIX Association, Berkeley (2012)Google Scholar
  18. 18.
    Harshvardhan, A.F., Amato, N.M., Rauchwerger, L.: The STAPL parallel graph library. In: Kasahara, H., Kimura, K. (eds.) LCPC 2012. LNCS, vol. 7760, pp. 46–60. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  19. 19.
    Hartley, T.D.R., et al.: Improving performance of adaptive component-based dataflow middleware. Parallel Comput. 38(6–7), 289–309 (2012)CrossRefGoogle Scholar
  20. 20.
    Heller, T., et al.: Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale System, ScalA 2013, pp. 1:1–1:8. ACM, New York (2013)Google Scholar
  21. 21.
    Steele Jr., G.L., et al.: Fortress (Sun HPCS Language). In: Padua, D.A. (ed.) Encyclopedia of Parallel Computing, pp. 718–735. Springer, Heidelberg (2011)Google Scholar
  22. 22.
    Kamil, A., Yelick, K.: Hierarchical computation in the SPMD programming model. In: Caṣcaval, C., Montesinos-Ortego, P. (eds.) LCPC 2013 - Testing. LNCS, vol. 8664, pp. 3–19. Springer, Heidelberg (2014)Google Scholar
  23. 23.
    Keßler, C.W.: NestStep: nested parallelism and virtual shared memory for the BSP model. J. Supercomput. 17(3), 245–262 (2000)CrossRefzbMATHGoogle Scholar
  24. 24.
    Mellor-Crummey, J., et al.: A new vision for coarray Fortran. In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, PGAS 2009, pp. 5:1–5:9. ACM, New York (2009)Google Scholar
  25. 25.
    MPI forum. MPI: A Message-Passing Interface Standard Version 3.1 (2015).
  26. 26.
    Musser, D., Derge, G., Saini, A.: STL Tutorial and Reference Guide, 2nd edn. Addison-Wesley, Boston (2001)Google Scholar
  27. 27.
    OpenMP Architecture Review Board. OpenMP Application Program Interface Specification (2011)Google Scholar
  28. 28.
    Page, L., et al.: The pagerank citation ranking: bringing order to the web (1998)Google Scholar
  29. 29.
    Papadopoulos, I., et al.: STAPL-RTS: An application driven runtime system. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ICS 2015, Newport Beach/Irvine, CA, USA, pp. 425–434, June 2015Google Scholar
  30. 30.
    Pearce, R., Gokhale, M., Amato, N.M.: Scaling techniques for massive scale-free graphs in distributed (external) memory. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS 2013, pp. 825–836. IEEE Computer Society, Washington (2013)Google Scholar
  31. 31.
    Pearce, R., Gokhale, M., Amato, N.M.: Faster parallel traversal of scale free graphs at extreme scale with vertex delegates. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 549–559. IEEE Press, Piscataway (2014)Google Scholar
  32. 32.
    Reinders, J.: Intel Threading Building Blocks. O’Reilly & Associates Inc., Sebastopol (2007)Google Scholar
  33. 33.
    Sillero, J., Borrell, G., Jiménez, J., Moser, R.D.: Hybrid OpenMP-MPI turbulent boundary layer code over 32k cores. In: Cotronis, Y., Danalis, A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp. 218–227. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  34. 34.
    Tanase, G., Buss, A., Fidel, A., Harshvardhan, Papadopoulos, I., Pearce, O., Smith, T., Thomas, N., Xu, X., Mourad, N., Vu, J., Bianco, M., Amato, N.M., Rauchwerger, L.: The STAPL parallel container framework. In: Proceedings of ACM SIGPLAN Symposium Principles and Practice Parallel Programming (PPoPP), San Antonio, pp. 235–246 (2011)Google Scholar
  35. 35.
    Thomas, N., et al.: ARMI: a high level communication library for STAPL. Parallel Process. Lett. 16(2), 261–280 (2006)CrossRefMathSciNetGoogle Scholar
  36. 36.
    Zandifar, M., Abdul Jabbar, M., Majidi, A., Keyes, D., Amato, N.M., Rauchwerger, L.: Composing algorithmic skeletons to express high-performance scientific applications. In: Proceedings of the 29th ACM International Conference on Supercomputing, ICS 2015, pp. 415–424. ACM, New York (2015)Google Scholar
  37. 37.
    Zhao, J., et al.: Isolation for nested task parallelism. In: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, pp. 571–588. ACM, New York (2013)Google Scholar
  38. 38.
    Zheng, Y., et al.: UPC++: A PGAS extension for C++. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1105–1114, May 2014Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Ioannis Papadopoulos
    • 1
    Email author
  • Nathan Thomas
    • 1
  • Adam Fidel
    • 1
  • Dielli Hoxha
    • 1
  • Nancy M. Amato
    • 1
  • Lawrence Rauchwerger
    • 1
  1. 1.Parasol Laboratory, Department of Computer Science and EngineeringTexas A&M UniversityCollege StationUSA

Personalised recommendations