Profiling Non-numeric OpenSHMEM Applications with the TAU Performance System

  • John Linford
  • Tyler A. Simon
  • Sameer Shende
  • Allen D. Malony
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8356)

Abstract

The recent development of a unified SHMEM framework, OpenSHMEM, has enabled further study in the porting and scaling of applications that can benefit from the SHMEM programming model. This paper focuses on non-numerical graph algorithms, which typically have a low FLOPS/byte ratio. An overview of the space and time complexity of Kruskal’s and Prim’s algorithms for generating a minimum spanning tree (MST) is presented, along with an implementation of Kruskal’s algorithm that uses OpenSHEM to generate the MST in parallel without intermediate communication. Additionally, a procedure for applying the TAU Performance System to OpenSHMEM applications to produce indepth performance profiles showing time spent in code regions, memory access patterns, and network load is presented. Performance evaluations from the Cray XK7 “Titan” system at Oak Ridge National Laboratory and a 48 core shared memory system at University of Maryland, Baltimore County are provided.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bader, D.A., Cong, G.: Fast shared-memory algorithms for computing the minimum spanning forest of sparse graphs. J. Par. Distrib. Comp. 66(11), 1366–1378 (2006), http://dx.doi.org/10.1016/j.jpdc.2006.06.001 CrossRefMATHGoogle Scholar
  2. 2.
    Browne, S., Dongarra, J., Garner, N., Ho, G., Mucci, P.: A portable programming interface for performance evaluation on modern processors. International Journal of High Performance Computing Applications 3(14), 189–204 (2000)CrossRefGoogle Scholar
  3. 3.
    Chapman, B., Curtis, T., Pophale, S., Poole, S., Kuehn, J., Koelbel, C., Smith, L.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010, pp. 2:1–2:3. ACM, New York (2010), http://doi.acm.org/10.1145/2020373.2020375
  4. 4.
    Geimer, M., Wolf, F., Wylie, B.J.N., Mohr, B.: Scalable parallel trace-based performance analysis. In: Mohr, B., Träff, J.L., Worringen, J., Dongarra, J. (eds.) PVM/MPI 2006. LNCS, vol. 4192, pp. 303–312. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Huck, K., Malony, A.: PerfExplorer: A performance data mining framework for large-scale parallel computing. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2005 (2005)Google Scholar
  6. 6.
    Huck, K., Malony, A., Bell, R., Li, L., Morris, A.: PerfDMF: Design and implementation of a parallel performance data management framework. In: Proceedings of the International Conference on Parallel Processing. IEEE (2005)Google Scholar
  7. 7.
    Jose, J., Kandalla, K., Luo, M., Panda, D.: Supporting hybrid MPI and OpenSHMEM over InfiniBand: Design and performance evaluation. In: The 41st International Conference on Parallel Processing (ICPP), pp. 219–228 (2012)Google Scholar
  8. 8.
    Knüpfer, A., Brendel, R., Brunst, H., Mix, H., Nagel, W.E.: Introducing the open trace format (OTF). In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 526–533. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Knupfer, A., Brunst, H., Nagel, W.: High performance event trace visualization. In: Proceedings of Parallel and Distributed Processing (PDP). IEEE (2005)Google Scholar
  10. 10.
    Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society 7 (1956)Google Scholar
  11. 11.
    Meuer, H., Strohmaier, E., Dongara, J., Simon, H.: TOP 500 Supercomputer Sites (2013), http://www.top500.org
  12. 12.
    Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the Graph 500 (May 2010)Google Scholar
  13. 13.
    Papadimitriou, C.H.: The Euclidean traveling salesman problem is NP-complete. Theoretical Computer Science 4(3), 237–244 (1977)CrossRefMATHMathSciNetGoogle Scholar
  14. 14.
    Pophale, S., Nanjegowda, R., Curtis, T., Chapman, B., Jin, H., Poole, S., Kuehn, J.: OpenSHMEM performance and potential: A NPB experimental study. In: The 6th Conference on Partitioned Global Address Space Programming Models, PGAS 2012 (2012)Google Scholar
  15. 15.
    Prim, R.C.: Shortest connection networks and some generalizations. Bell System Technical Journal 36, 1389–1401 (1957)CrossRefGoogle Scholar
  16. 16.
    Shende, S.S., Malony, A.D.: The TAU Parallel Performance System. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006), http://dx.doi.org/10.1177/1094342006064482 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • John Linford
    • 2
  • Tyler A. Simon
    • 1
    • 2
  • Sameer Shende
    • 2
    • 3
  • Allen D. Malony
    • 2
    • 3
  1. 1.University of Maryland Baltimore CountyUSA
  2. 2.ParaTools Inc.USA
  3. 3.University of OregonUSA

Personalised recommendations