Skip to main content

Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing

  • 183 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13440)


This paper introduces a new approach to automatic ahead-of-time (AOT) parallelization and optimization of sequential Python programs for execution on distributed heterogeneous platforms. Our approach enables AOT source-to-source transformation of Python programs, driven by the inclusion of type hints for function parameters and return values. These hints can be supplied by the programmer or obtained by dynamic profiler tools; multi-version code generation guarantees the correctness of our AOT transformation in all cases.

Our compilation framework performs automatic parallelization and sophisticated high-level code optimizations for the target distributed heterogeneous hardware platform. It introduces novel extensions to the polyhedral compilation framework that unify user-written loops and implicit loops present in matrix/tensor operators, as well as automated selection of CPU vs. GPU code variants. Finally, output parallelized code generated by our approach is deployed using the Ray runtime for scheduling distributed tasks across multiple heterogeneous nodes in a cluster, thereby enabling both intra-node and inter-node parallelism.

Our empirical evaluation shows significant performance improvements relative to sequential Python in both single-node and multi-node experiments, with a performance improvement of over 20,000\(\times \) when using 24 nodes and 144 GPUs in the OLCF Summit supercomputer for the Space-Time Adaptive Processing (STAP) radar application.


  • Parallelizing compilers
  • Python language
  • Parallel computing
  • Heterogeneous computing
  • Distributed computing

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-12597-3_22
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   64.99
Price excludes VAT (USA)
  • ISBN: 978-3-031-12597-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   84.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.


  1. Abella-González, M.A., Carollo-Fernández, P., Pouchet, L.N., Rastello, F., Rodríguez, G.: Polybench/python: Benchmarking python environments with polyhedral optimizations. In: Proceedings of CC 2021 (2021).

  2. Ardö, H., Bolz, C.F., FijaBkowski, M.: Loop-aware optimizations in pypy’s tracing jit. SIGPLAN Not. 48(2), 63–72 (2012).

    CrossRef  Google Scholar 

  3. Bezanson, J., Karpinski, S., Shah, V.B., Edelman, A.: Julia: a fast dynamic language for technical computing. CoRR abs/1209.5145 (2012)

    Google Scholar 

  4. Bondhugula, U., Acharya, A., Cohen, A.: The pluto+ algorithm: a practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. Program. Lang. Syst. 38(3), April 2016.

  5. Cython (2007).

  6. Dalcin, L., Fang, Y.L.L.: mpi4py: status update after 12 years of development. Comput. Sci. Eng. (2021).

    CrossRef  Google Scholar 

  7. Grosser, T., Größlinger, A., Lengauer, C.: Polly - performing polyhedral optimizations on a low-level intermediate representation. Parallel Process. Lett. 22(4), 1250010 (2012)

    Google Scholar 

  8. LCF Summit supercomputer (2019).

  9. Melvin, W.L.: Chapter 12: Space-time adaptive processing for radar. Academic Press Library in Signal Processing: Volume 2 Comm. and Radar Signal Proc. (2014)

    Google Scholar 

  10. Moritz, P., et al.: Ray: a distributed framework for emerging ai applications. In: Proceedings of OSDI 2018 (2018)

    Google Scholar 

  11. NERSC Cori supercomputer (2016).

  12. Nuitka (2012).

  13. Numba (2012).

  14. NumPy (2006).

  15. PolyBench: The polyhedral benchmark suite.

  16. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing. 3 edn. (2007)

    Google Scholar 

  17. PyPy (2019).

  18. Pyston (2014).

  19. Python typed AST package (2019).

  20. SciPy (2001).

  21. Shed Skin (2012).

  22. Shirako, J., Hayashi, A., Sarkar, V.: Optimized two-level parallelization for gpu accelerators using the polyhedral model. In: Proceedings of CC 2017 (2017).

  23. Shirako, J., Pouchet, L.N., Sarkar, V.: Oil and water can mix: An integration of polyhedral and ast-based transformations. In: Proceedings of SC’14 (2014).

  24. Shirako, J., Sarkar, V.: Integrating data layout transformations with the polyhedral model. In: Proceedings of IMPACT 2019 (2019)

    Google Scholar 

  25. Shirako, J., Sarkar, V.: An affine scheduling framework for integrating data layout and loop transformations. In: Proceedings of LCPC 2020 (2020).

  26. SymPy (2017).

  27. Verdoolaege, et al.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4), 54:1–54:23 (2013).

  28. Verdoolaege, S.: isl: an integer set library for the polyhedral model. In: Mathematical Software - ICMS 2010 (2010).

  29. Wang, S., et al.: Lineage stash: Fault tolerance off the critical path. In: Proceedings of the ACM Symposium on Operating System Principles (SOSP’19), SOSP 2019 (2019)

    Google Scholar 

  30. Zhou, T., et al.: Intrepydd: Performance, productivity and portability for data science application kernels. In: Proceedings of Onward! ’20 (2020).

  31. Zinenko, O., et al.: Modeling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling. In: Proceedings of CC 2018 (2018).

Download references


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Agreement No. HR0011-20-9-0020. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. Also, this research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jun Shirako .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Shirako, J., Hayashi, A., Paul, S.R., Tumanov, A., Sarkar, V. (2022). Automatic Parallelization of Python Programs for Distributed Heterogeneous Computing. In: Cano, J., Trinder, P. (eds) Euro-Par 2022: Parallel Processing. Euro-Par 2022. Lecture Notes in Computer Science, vol 13440. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12596-6

  • Online ISBN: 978-3-031-12597-3

  • eBook Packages: Computer ScienceComputer Science (R0)