From Serial Loops to Parallel Execution on Distributed Systems

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7484)


Programmability and performance portability are two major challenges in today’s dynamic environment. Algorithm designers targeting efficient algorithms should focus on designing high-level algorithms exhibiting maximum parallelism, while relying on compilers and run-time systems to discover and exploit this parallelism, delivering sustainable performance on a variety of hardware. The compiler tool presented in this paper can analyze the data flow of serial codes with imperfectly nested, affine loop-nests and if statements, commonly found in scientific applications. This tool operates as the front-end compiler for the DAGuE run-time system by automatically converting serial codes into the symbolic representation of their data flow. We show how the compiler analyzes the data flow, and demonstrate that scientifically important, dense linear algebra operations can benefit from this analysis, and deliver high performance on large scale platforms.


compiler analysis symbolic data flow distributed computing task scheduling 


  1. 1.
    Ancourt, C., Irigoin, F.: Scanning polyhedra with do loops. In: Proceedings of ACM PPoPP 1991, Williamsburg, VA, pp. 39–50 (1991)Google Scholar
  2. 2.
    Baskaran, M.M., Vydyanathan, N., Bondhugula, U.K.R., Ramanujam, J., Rountev, A., Sadayappan, P.: Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors. In: Proceedings of ACM PPoPP 2009, Raleigh, NC, pp. 219–228 (2009)Google Scholar
  3. 3.
    Bastoul, C.: Code Generation in the Polyhedral Model Is Easier Than You Think. In: Proceedings of IEEE PACT 2004, pp. 7–16. Antibes Juan-les-Pins, France (2004)Google Scholar
  4. 4.
    Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)zbMATHCrossRefGoogle Scholar
  5. 5.
    Blume, W., Doallo, R., Eigenmann, R., Grout, J., Hoeflinger, J., Lawrence, T., Lee, J., Padua, D., Paek, Y., Pottenger, B., Rauchwerger, L., Tu, P.: Parallel programming with polaris. IEEE Computer 29, 78–82 (1996)CrossRefGoogle Scholar
  6. 6.
    Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of ACM PLDI 2008, Tucson, AZ, pp. 101–113 (2008)Google Scholar
  7. 7.
    Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project. Tech. Rep. 232, LAWN (September 2010)Google Scholar
  8. 8.
    Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: IEEE PDSEC 2011, Anchorage, AK (2011)Google Scholar
  9. 9.
    Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed dag engine for high performance computing. In: HIPS 2011, Anchorage, AK (2011)Google Scholar
  10. 10.
    Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.J.: DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing (2011) (to appear),
  11. 11.
    Bosilca, G., Bouteiller, A., Hérault, T., Lemarinier, P., Saengpatsa, N.O., Tomov, S., Dongarra, J.J.: Performance portability of a gpu enabled factorization with the dague framework. In: IEEE CLUSTER, pp. 395–402 (2011)Google Scholar
  12. 12.
    Buttari, A., Langou, J., Kurzak, J., Dongarra, J.J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. Syst. Appl. 35, 38–53 (2009)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurrency Computat.: Pract. Exper. 15(9), 803–820 (2003)CrossRefGoogle Scholar
  14. 14.
    van Engelen, R.A., Birch, J., Shou, Y., Walsh, B., Gallivan, K.A.: A unified framework for nonlinear dependence testing and symbolic analysis. In: Proceedings of ACM ICS 2004, Malo, France, pp. 106–115 (2004)Google Scholar
  15. 15.
    Feautrier, P.: Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20, 23–53 (1991), 10.1007/BF01407931zbMATHCrossRefGoogle Scholar
  16. 16.
    Gustavson, F.G., Karlsson, L., Kågström, B.: Distributed SBP cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. 36(2), 1–25 (2009)CrossRefGoogle Scholar
  17. 17.
    Hall, M.W., Anderson, J.M., Amarasinghe, S.P., Murphy, B.R., Liao, S.W., Bugnion, E., Lam, M.S.: Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer 29, 84–89 (1996)CrossRefGoogle Scholar
  18. 18.
    Kyriakopoulos, K., Psarris, K.: Data dependence analysis techniques for increased accuracy and extracted parallelism. International Journal of Parallel Programming 32, 317–359 (2004)zbMATHCrossRefGoogle Scholar
  19. 19.
    Kyriakopoulos, K., Psarris, K.: Nonlinear Symbolic Analysis for Advanced Program Parallelization. IEEE Transactions on Parallel and Distributed Systems 20, 623–640 (2009)CrossRefGoogle Scholar
  20. 20.
    Maydan, D.E., Hennessy, J.L., Lam, M.S.: Efficient and exact data dependence analysis. In: Proceedings of ACM PLDI 1991, Toronto, Ontario, pp. 1–14 (1991)Google Scholar
  21. 21.
    Perez, J.M., Badia, R.M., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: Proceedings of IEEE Cluster Computing, pp. 142–151 (2008)Google Scholar
  22. 22.
    Pugh, W.: The omega test: a fast and practical integer programming algorithm for dependence analysis. In: Proceedings of the ACM/IEEE SC 1991, pp. 4–13 (1991)Google Scholar
  23. 23.
    Quilleré, F., Rajopadhye, S., Wilde, D.: Generation of efficient nested loops from polyhedra. Int. J. Parallel Program. 28, 469–498 (2000)CrossRefGoogle Scholar
  24. 24.
    Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: Proceedings of ACM/IEEE SC 2009 (2009)Google Scholar
  25. 25.
    Vasilache, N., Bastoul, C., Cohen, A., Girbal, S.: Violated dependence analysis. In: Proceedings of ACM ICS 2006, Cairns, Queensland, Australia, pp. 335–344 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.University of TennesseeKnoxvilleUSA
  2. 2.University of ManchesterManchesterUK

Personalised recommendations