Cluster Computing

, Volume 20, Issue 2, pp 1179–1192 | Cite as

Predicting HPC parallel program performance based on LLVM compiler

Article
  • 278 Downloads

Abstract

Performance prediction of parallel program plays key roles in many areas, such as parallel system design, parallel program optimization, and parallel system procurement. Accurate and efficient performance prediction on large-scale parallel systems is a challenging problem. To solve this problem, we present an effective framework for performance prediction based on the LLVM compiler technique in this paper. We can predict the performance of a parallel program on a small amount of nodes of the target parallel system using this framework toned but not execute this parallel program on a corresponding full-scale parallel system. This framework predicts the performance of computation and communication components separately and combines the two predictions to achieve full program prediction. As for sequential computation, we first combine the static branch probability and loop trip count identification and propose a new instrumentation method to acquire the number of each instruction type. We then construct a test program to measure the average execution time of each instruction type. Finally, we utilize the pruning technique to convert a parallel program into a corresponding sequential program to predict the performance on only one node of the target parallel system. As for communication, we utilize the LogGP model to model point-to-point communication and the artificial neural network technique to model collective communication. We validate our approach by a set of experiments that predict the performance of NAS parallel benchmarks and CGPOP parallel application. Experimental results show that the proposed framework can accurately predict the execution time of parallel programs, and the average error rate of these programs is 10.86%.

Keywords

Parallel program Performance prediction Loop trip count Artificial neural network LLVM 

References

  1. 1.
    Top 500 supercomputer site. http://www.top500.org
  2. 2.
    Kerbyson, D.J., Alme, H.J., Hoisie, A., Petrini, F., Wasserman, H.J., Gittings, M.: Predictive performance and scalability modeling of a large-scale application. In: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. ACM (2001)Google Scholar
  3. 3.
    Sharapov, I., Kroeger, R., Delamarter, G., Cheveresan, R., Ramsay, M.: A case study in top-down performance estimation for a large-scale parallel application. In: Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 81–89. ACM (2006)Google Scholar
  4. 4.
    Snavely, A., Carrington, L., Wolter, N., Labarta, J., Badia, R., Purkayastha, A.: A framework for performance modeling and prediction. In: Supercomputing, ACM/IEEE 2002 Conference. IEEE (2002)Google Scholar
  5. 5.
    Zheng, G., Kakulapati, G., Kalé, L.V.: Bigsim: a parallel simulator for performance prediction of extremely large parallel machines. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. IEEE (2004)Google Scholar
  6. 6.
    Zhang, W., Cheng, A.M., Subhlok, J.: DwarfCode: a performance prediction tool for parallel applications. IEEE Trans. Comput. 65(2), 495–507 (2016)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Message passing interface forum. http://www.mpiforum.org/
  8. 8.
    Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization, pp. 75–86. CGO IEEE (2004)Google Scholar
  9. 9.
    Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation. In: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 95–105. ACM (1995)Google Scholar
  10. 10.
    Hoefler, T., Mehlan, T., Lumsdaine, A., Rehm, W.: Netgauge: a network performance measurement framework. In: International Conference on High Performance Computing and Communications, pp. 659–671. Springer, Berlin (2007)Google Scholar
  11. 11.
  12. 12.
    Wu, Y., Larus, J.R.: Static branch frequency and program profile analysis. In: Proceedings of the 27th Annual International Symposium on Microarchitecture, pp. 1–11. ACM (1994)Google Scholar
  13. 13.
    Aho, A.V., Sethi, R., Ullman, J.D.: Compilers, Principles, Techniques, pp. 670–671. Addison wesley, Boston (1986)Google Scholar
  14. 14.
    Hoefler, T., Lichei, A., Rehm, W.: Low-overhead LogGP parameter assessment for modern interconnection networks. In: 2007 IEEE International Parallel and Distributed Processing Symposium, pp. 1–8. IEEE (2007)Google Scholar
  15. 15.
    Pješivac-Grbović, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance analysis of MPI collective operations. Clust. Comput. 10(2), 127–143 (2007)CrossRefGoogle Scholar
  16. 16.
    OpenMPI project. https://www.open-mpi.org/
  17. 17.
    Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Simon, H.D.: The NAS parallel benchmarks—summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 158–165. ACM (1991)Google Scholar
  18. 18.
    Stone, A., Dennis, J.M., & Strout, M.M.: The CGPOP miniapp, version 1.0. Colorado State University, Technical Report CS-11-103 (2011)Google Scholar
  19. 19.
    Velho, P., Legrand, A.: Accuracy study and improvement of network simulation in the simgrid framework. In: Proceedings of the second International Conference on Simulation Tools and Techniques. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2009)Google Scholar
  20. 20.
    Malyshkin, V.E.: Peculiarities of numerical algorithms parallel implementation for exa-flops multicomputers. Int. J. Big Data Intell. 1(1–2), 65–73 (2014)CrossRefGoogle Scholar
  21. 21.
    Viswanathan, V.: Discovery of semantic associations in an RDF graph using bi-directional BFS on massively parallel hardware. Int. J. Big Data Intell. 3(3), 176–181 (2016)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Wu, C.C., Ke, J.Y., Lin, H., Jhan, S.S.: Adjusting thread parallelism dynamically to accelerate dynamic programming with irregular workload distribution on GPGPUs. Int. J. Grid High Perform. Comput. (IJGHPC) 6(1), 1–20 (2014)CrossRefGoogle Scholar
  23. 23.
    Barker, K.J., Pakin, S., Kerbyson, D.J.: A performance model of the krak hydrodynamics application. In: 2006 International Conference on Parallel Processing (ICPP’06), pp. 245–254. IEEE (2006)Google Scholar
  24. 24.
    Hoisie, A., Lubeck, O., Wasserman, H.: Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications. Int. J. High Perform. Comput. Appl. 14(4), 330–346 (2000)CrossRefGoogle Scholar
  25. 25.
    Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM (2013)Google Scholar
  26. 26.
    Cascaval, C., DeRose, L., Padua, D.A., Reed, D.A.: Compile-time based performance prediction. In: International Workshop on Languages and Compilers for Parallel Computing, pp. 365–379. Springer, Berlin (1999)Google Scholar
  27. 27.
    Zhai, J., Chen, W., Zheng, W.: Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In: ACM Sigplan Notices, vol. 45, no. 5, pp. 305–314. ACM (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina
  2. 2.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignChampaignUSA

Personalised recommendations