Automatic program parallelization with a block data distribution

Abstract

This paper discusses several automated methods of acceleration of program operation. The acceleration is achieved by parallelization and optimization of memory access. Optimization of accesses to RAM is provided by switching to a block code and a block distribution of arrays. When using a distributed memory, automated distributions of arrays and array distributions with overlapping are used. Automation is implemented by using the C language with pragmas in the Optimizing Parallelizing System. Numerical results for problems of linear algebra and mathematical physics are presented. Some features of this demonstration converter have a remote access to the Internet.

This is a preview of subscription content, log in to check access.

References

  1. 1.

    Abu-Khalil, Zh.M., Morylev, R.I., and Steinberg, B.Ya., Parallel Global Alignment Algorithm with the Optimal Use of Memory, Sovr. Probl. NaukiObr., 2013, no. 1; http://www.science-education.ru/107-8139.

    Google Scholar 

  2. 2.

    Automatic Parallelizer OPS Web Interface; http://ops.opsgroup.ru/opsweb-datadistr.php.

  3. 3.

    Arykov, C.B. and Malyshkin, V.E., ASPEKT System of Asynchronous Parallel Programming, Vych. Metody Progr., 2008, vol. 9, no. 1, pp. 205–209.

    Google Scholar 

  4. 4.

    Val’kovskii, V.A., Parallel Execution of Cycles. Method of Parallelepipeds, Kibern., 1982, no. 2, pp. 51–62.

    Google Scholar 

  5. 5.

    Val’kovskii, V.A., Parallel Execution of Cycles. Method of Pyramids, Kibern., 1983, no. 5, pp. 51–55.

    Google Scholar 

  6. 6.

    Val’kovskii, V.A., Parallelizatsiya algoritmov i programm. Strukturnyi podkhod (Parallelization of Algorithms and Codes. Structural Approach), Moscow: Radio i Svyaz’, 1989.

    Google Scholar 

  7. 7.

    Korneev, V.V., Problems of Programming of Supercomputers on the Basis of Multi-Core Multi-Thread Crystals, in Scientific Service in the Internet: Scalability, Parallelism, and Efficiency, Proc. All-Russia Supercomputer Conf., September 21–26, 2009, Novorossiisk; Moscow: Moscow State Univ., 2009.

    Google Scholar 

  8. 8.

    Linev, A.V., Bogolyubov, D.K., and Bastrakov, S.I., Technologies of Parallel Programming for Processors with New Architectures. Manual. Ser. Supercomputer Education, Gergel’, V.P., Ed., Moscow: Moscow State University, 2010.

  9. 9.

    Likhoded, N.A., Distribution of Operations and Data Arrays between Processors, Programm., 2003, no. 3, pp. 73–80.

    Google Scholar 

  10. 10.

    Optimizing Parallelizing System; www.ops.rsu.ru

  11. 11.

    Prangishvili, I.V., Vilenkin, S.Ya., and Medvedev, I.L., Parallel’nye vychislitel’nye sistemy s obshchim upravleniem (Parallel Computational Systems with Common Control), Moscow: Energoatomizdat, 1983.

    Google Scholar 

  12. 12.

    Savel’ev, V.A., Optimization of Parallelization of Computations of the Direct Method Type in the Heat Conduction Problem for Systems with Distributed Memory, Izv. Vuzov Severo-Kavk. Region, Estestv. Nauki, 2012, no. 4, pp. 12–14.

    Google Scholar 

  13. 13.

    Steinberg, B.Ya., Conflictless Allocation of Arrays in Parallel Computations, Kibern. Sist. An., 1999, no. 1, pp. 166–178.

    Google Scholar 

  14. 14.

    Steinberg, B.Ya., Block-Recursive Parallel Multiplication of Matrices, Izv. Vuzov, Priborostr., 2009, vol. 52, no. 10, pp. 33–41.

    Google Scholar 

  15. 15.

    Steinberg, B.Ya., Optimizatsiya razmeshcheniya dannykh v parallel’noi pamyati (Optimization of Data Allocation in ParallelMemory), Rostov-on-Don: Southern Federal University, 2010.

    Google Scholar 

  16. 16.

    Steinberg, B.Ya., Block-Affine Allocation of Data in Parallel Memory, Inform. Tekhnol., 2010, no. 6, pp. 36–41.

    Google Scholar 

  17. 17.

    Steinberg, B.Ya., Block-Recurrent Allocation of a Matrix for Parallel Execution of the Floyd Algorithm, Izv. Vuzov Severo-Kavk. Region. Estestv. Nauki, 2010, no. 5, pp. 31–33.

    Google Scholar 

  18. 18.

    Steinberg, B.Ya., Dependence of the Optimal Distribution of the Processor Crystal Area between the Memory and Computational Cores on the Algorithm, Proc. VI Int. Conf. on Parallel Computing and Control Problems, PACO’2012, October 24–26, 2012, Moscow: Trapeznikov Institute of Control Problems, Russian Academy of Sciences, 2012, pp. 99–108.

    Google Scholar 

  19. 19.

    Steinberg, B.Ya. and Yurushkin, M.V., New Compilers for New Processors, Otkr. Sist., 2013, no. 1; http://www.osp.ru/os/2013/01/13033990/.

    Google Scholar 

  20. 20.

    Eisymont, L.K. and Gorbunov, V.S., On theWay to an ExaFlop Supercomputer: Results, Directions, Trends, Proc. 3d Moscow Supercomputer Forum, Moscow, 2013; http://www.osp.ru/docs/mscf/mscf-001.pdf.

    Google Scholar 

  21. 21.

    Denning, P.J., The Locality Principle, Comm. ACM, 2005, vol. 48, no. 7, pp. 19–24.

    Article  Google Scholar 

  22. 22.

    DVMsystem; URL: http://www.keldysh.ru/dvm/

  23. 23.

    Frigo, M., Leiserson, C.E., Prokop, H., and Ramachandran, S., Cache-Oblivious Algorithms, Proc. 40th IEEE Symposium on Foundations of Computer Science, FOCS 99, New York, 1999, pp. 285–297.

    Google Scholar 

  24. 24.

    Herrero, J.R. and Navarro, J.J., Using Non-Canonical Array Layouts in Dense Matrix Operations, Lect. Notes Comp. Sci., no. 4699, pp. 580–588.

  25. 25.

    Kulkurni, D. and Stumm, M., Loop and Data Transformations: A Tutorial, Toronto: Computer Systems Research Institute, Univ. of Toronto, 1993, Tech. Rep. CSRI 337.

    Google Scholar 

  26. 26.

    Goto, K. and A. van de Geijn, R., Anatomy of High-Performance Matrix Multiplication, ACM Trans. Math. Softw., 2008, vol. 34, no. 3, pp. 1–25.

    Article  Google Scholar 

  27. 27.

    Message Passing Interface (MPI) Forum Home Page; URL: http://www.mpi-forum.org/.

  28. 28.

    Chrisochoides, N., Aboelaze, M., Houstis, E., and Houstis, C., Scalable BLAS 2 and 3 Matrix Multiplication for Sparse Banded Matrices on Distributed Memory MIMD Machines; http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.9748.

  29. 29.

    PARALLEL.RU. Cluster Facilities in Russian and CIS Countries; URL: http://parallel.ru/russia/russian_clusters.html#infini.

  30. 30.

    Park, N., Hong, B., and Prasanna, V.K., Tiling, Block Data Layout, and Memory Hierarchy Performance, IEEE Trans. Par. Distrib. Systems, 2003, vol. 14, no. 7, pp. 640–654.

    Article  Google Scholar 

  31. 31.

    PLASMA Users’ Guide. Parallel Linear Algebra Software for Multicore Architectures, vers. 2.3, USA, Knoxville: Univ. of Tennessee, 2010.

    Google Scholar 

  32. 32.

    SGI-HPC, Servers, Storage, Data Center Solutions, Cloud Computing; URL: http://www.shmem.org/.

  33. 33.

    The Para Wise Automatic Parallelization Environment; URL: http://www.parallelsp.com/parawise.htm.

  34. 34.

    Wolfe, M., High Performance Compilers for Parallel Computing, Redwood: Addison-Wesley, 1996.

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to L. R. Gervich.

Additional information

Original Russian Text © L.R. Gervich, E.N. Kravchenko, B.Ya. Steinberg, M.V. Yurushkin, 2015, published in Sibirskii Zhurnal Vychislitel’noi Matematiki, 2015, Vol. 18, No. 1, pp. 41–53.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gervich, L.R., Kravchenko, E.N., Steinberg, B.Y. et al. Automatic program parallelization with a block data distribution. Numer. Analys. Appl. 8, 35–45 (2015). https://doi.org/10.1134/S1995423915010048

Download citation

Keywords

  • automatic parallelization
  • tiling
  • block distribution of arrays
  • optimization of memory access
  • distribution with overlapping