PerWiz: A What-If Prediction Tool for Tuning Message Passing Programs

  • Fumihiko Ino
  • Yuki Kanbe
  • Masao Okita
  • Kenichi Hagihara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3402)


This paper presents PerWiz, a performance prediction tool for improving the performance of message passing programs. PerWiz focuses on locating where a significant improvement can be achieved. To locate this, PerWiz performs a post-mortem analysis based on a realistic parallel computational model, LogGPS, so that predicts what performance will be achieved if the programs are modified according to typical tuning techniques, such as load balancing for a better workload distribution and message scheduling for a shorter waiting time. We also show two case studies where PerWiz played an important role in improving the performance of regular applications. Our results indicate that PerWiz is useful for application developers to assess the potential reduction in execution time that will be derived from program modification.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Message Passing Interface Forum: MPI: A message-passing interface standard. Int’l J. Supercomputer Applications and High Performance Computing 8, 159–416 (1994)Google Scholar
  2. 2.
    Heath, M.T., Etheridge, J.A.: Visualizing the performance of parallel programs. IEEE Software 8, 29–39 (1991)CrossRefGoogle Scholar
  3. 3.
    Nagel, W.E., Arnold, A., Weber, M., Hoppe, H.C., Solchenbach, K.: VAMPIR: Visualization and analysis of MPI resources. The J. Supercomputing 12, 69–80 (1996)Google Scholar
  4. 4.
    Zaki, O., Lusk, E., Gropp, W., Swider, D.: Toward scalable performance visualization with Jumpshot. Int’l J. High Performance Computing Applications 13, 277–288 (1999)CrossRefGoogle Scholar
  5. 5.
    Rose, L.A.D., Reed, D.A.: SvPablo: A multi-language architecture-independent performance analysis system. In: Proc. 28th Int’l Conf. Parallel Processing (ICPP 1999), pp. 311–318 (1999)Google Scholar
  6. 6.
    Yan, J., Sarukkai, S., Mehra, P.: Performance measurement, visualization and modeling of parallel and distributed programs using the AIMS toolkit. Software: Practice and Experience 25, 429–461 (1995)CrossRefGoogle Scholar
  7. 7.
    Miller, B.P., Callaghan, M.D., Cargille, J.M., Hollingsworth, J.K., Irvin, R.B., Karavanic, K.L., Kunchithapadam, K., Newhall, T.: The Paradyn parallel performance measurement tool. IEEE Computer 28, 37–46 (1995)Google Scholar
  8. 8.
    Fahringer, T., Seragiotto, C.: Automatic search for performance problems in parallel and distributed programs by using multi-experiment analysis. In: Sahni, S.K., Prasanna, V.K., Shukla, U. (eds.) HiPC 2002. LNCS, vol. 2552, pp. 151–162. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  9. 9.
    Cain, H.W., Miller, B.P., Wylie, B.J.N.: A callgraph-based search strategy for automated performance diagnosis. Concurrency and Computation: Practice and Experience 14, 203–217 (2002)MATHCrossRefGoogle Scholar
  10. 10.
    Roth, P.C., Miller, B.P.: Deep Start: a hybrid strategy for automated performance problem searches. Concurrency and Computation: Practice and Experience 15, 1027–1046 (2003)MATHCrossRefGoogle Scholar
  11. 11.
    Block, R.J., Sarukkai, S., Mehra, P.: Automated performance prediction of message-passing parallel programs. In: Proc. High Performance Networking and Computing Conf, SC 1995 (1995)Google Scholar
  12. 12.
    Hollingsworth, J.K.: Critical path profiling of message passing and shared-memory programs. IEEE Trans. Parallel and Distributed Systems 9, 1029–1040 (1998)CrossRefGoogle Scholar
  13. 13.
    Eom, H., Hollingsworth, J.K.: A tool to help tune where computation is performed. IEEE Trans. Software Engineering 27, 618–629 (2001)CrossRefGoogle Scholar
  14. 14.
    Ino, F., Fujimoto, N., Hagihara, K.: LogGPS: A parallel computational model for synchronization analysis. In: Proc. 8th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP 2001), pp. 133–142 (2001)Google Scholar
  15. 15.
    Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 558–565 (1978)MATHCrossRefGoogle Scholar
  16. 16.
    Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: Towards a realistic model of parallel computation. In: Proc. 4th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP 1993), pp. 1–12 (1993)Google Scholar
  17. 17.
    Alexandrov, A., Ionescu, M., Schauser, K., Scheiman, C.: LogGP: Incorporating long messages into the LogP model for parallel computation. J. Parallel and Distributed Computing 44, 71–79 (1997)CrossRefGoogle Scholar
  18. 18.
    Herrarte, V., Lusk, E.: Studying parallel program behavior with upshot. Technical Report ANL–91/15, Argonne National Laboratory (1991)Google Scholar
  19. 19.
    Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 789–828 (1996)MATHCrossRefGoogle Scholar
  20. 20.
    Boden, N.J., Cohen, D., Felderman, R.E., Kulawik, A.E., Seitz, C.L., Seizovic, J.N., Su, W.K.: Myrinet: A gigabit-per-second local-area network. IEEE Micro 15, 29–36 (1995)CrossRefGoogle Scholar
  21. 21.
    O’Carroll, F., Tezuka, H., Hori, A., Ishikawa, Y.: The design and implementation of zero copy MPI using commodity hardware with a high performance network. In: Proc. 12th ACM Int’l Conf. Supercomputing (ICS 1998), pp. 243–250 (1998)Google Scholar
  22. 22.
    Schnabel, J.A., Rueckert, D., Quist, M., Blackall, J.M., Castellano-Smith, A.D., Hartkens, T., Penney, G.P., Hall, W.A., Liu, H., Truwit, C.L., Gerritsen, F.A., Hill, D.L.G., Hawkes, D.J.: A generic framework for non-rigid registration based on non-uniform multi-level free-form deformations. In: Niessen, W.J., Viergever, M.A. (eds.) MICCAI 2001. LNCS, vol. 2208, pp. 573–581. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  23. 23.
    Graham, S.L., Kessler, P.B., McKusick, M.K.: gprof: a call graph execution profiler. In: Proc. SIGPLAN Symp. Compiler Construction (SCC 1982), pp. 120–126 (1982)Google Scholar
  24. 24.
    Takeuchi, A., Ino, F., Hagihara, K.: An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors. Parallel Computing 29, 1745–1762 (2003)CrossRefGoogle Scholar
  25. 25.
    Truong, H.L., Fahringer, T.: SCALEA: a performance analysis tool for parallel programs. Concurrency and Computation: Practice and Experience 15, 1001–1025 (2003)MATHCrossRefGoogle Scholar
  26. 26.
    Taylor, V., Wu, X., Stevens, R.: Prophesy: An infrastructure for performance analysis and modeling of parallel and grid applications. ACM SIGMETRICS Performance Evaluation Review 30, 13–18 (2003)CrossRefGoogle Scholar
  27. 27.
    Geisler, J., Taylor, V.: Performance coupling: Case studies for improving the performance of scientific applications. J. Parallel and Distributed Computing 62, 1227–1247 (2002)MATHCrossRefGoogle Scholar
  28. 28.
    Brunst, H., Malony, A.D., Shende, S.S., Bell, R.: Online remote trace analysis of parallel applications on high-performance clusters. In: Veidenbaum, A., Joe, K., Amano, H., Aiso, H. (eds.) ISHPC 2003. LNCS, vol. 2858, pp. 440–449. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  29. 29.
    Labarta, J., Girona, S., Cortes, T.: Analyzing scheduling policies using Dimemas. Parallel Computing 23, 23–34 (1997)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Fumihiko Ino
    • 1
  • Yuki Kanbe
    • 1
  • Masao Okita
    • 1
  • Kenichi Hagihara
    • 1
  1. 1.Graduate School of Information Science and TechnologyOsaka UniversityOsakaJapan

Personalised recommendations