Skip to main content

Non-uniform Domain Decomposition for Heterogeneous Accelerated Processing Units

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11333))

Abstract

The use of heterogeneous architectures has become indispensable in optimizing application performance. Nowadays, one of the most popular heterogeneous architectures is discrete CPU+GPU. Despite the high computational power present in such architectures, in many cases, memory data transfers between CPU and GPU are significant performance bottlenecks. As an attempt to mitigate performance costs involved in data transfers, chipmakers started to integrate CPU and GPU cores in the same fabric sharing the same main memory but with different memory address spaces in architectures denominated APUs (Accelerated Processing Unit). To efficiently exploit heterogeneous CPU+GPU architectures it is needed to split the data so that both processing units (PUs) can perform the computations in parallel. Although this approach results in significant performance improvements, some applications can also be functionality split, as is the case of the Lattice-Boltzmann Method (LBM). In this work, we evaluate the performance of each kernel resulting from the functional decomposition of an OpenCL Lattice-Boltzmann method implementation using non-uniform domain decomposition between CPU and GPU on an APU to better understand the performance impact of different non-uniform domain decompositions between CPU and GPU on each kernel. The experimental results performed on an AMD APU A10-7870K show that uniform domain decompositions between each kernel on the same PU but non-uniform domain decompositions between CPU and GPU affect each kernel differently. These results suggest that non-uniform domain decompositions between each kernel on the same PU and not only between the different PUs can improve even more the performance of the application.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Calore, E., Gabbana, A., Kraus, J., Pellegrini, E., Schifano, S.F., Tripiccione, R.: Massively parallel lattice-Boltzmann codes on large GPU clusters. Parallel Comput. 58, 1–24 (2016). https://doi.org/10.1016/j.parco.2016.08.005

    Article  MathSciNet  Google Scholar 

  2. Chen, S., Doolen, G.D.: Lattice Boltzmann method for fluid flows. Ann. Rev. Fluid Mech. 30(1), 329–364 (1998). https://doi.org/10.1146/annurev.fluid.30.1.329

    Article  MathSciNet  MATH  Google Scholar 

  3. Feichtinger, C., Habich, J., Köstler, H., Hager, G., Rüde, U., Wellein, G.: A flexible patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters. Parallel Comput. 37(9), 536–549 (2011). https://doi.org/10.1016/j.parco.2011.03.005

    Article  MathSciNet  Google Scholar 

  4. McClure, J.E., Prins, J.F., Miller, C.T.: A novel heterogeneous algorithm to simulate multiphase flow in porous media on multicore CPU-GPU systems. Comput. Phys. Commun. 185(7), 1865–1874 (2014). https://doi.org/10.1016/j.cpc.2014.03.012

    Article  MathSciNet  Google Scholar 

  5. McNamara, G.R., Zanetti, G.: Use of the Boltzmann equation to simulate lattice-gas automata. Phys. Rev. Lett. 61(20), 2332–2335 (1988). https://doi.org/10.1103/PhysRevLett.61.2332

    Article  Google Scholar 

  6. Meadows, L., Ishikawa, K.: OpenMP tasking and MPI in a lattice QCD benchmark. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 77–91. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_6

    Chapter  Google Scholar 

  7. Nagar, P., Song, F., Zhu, L., Lin, L.: LBM-IB: a parallel library to solve 3D fluid-structure interaction problems on manycore systems. In: Proceedings of the International Conference on Parallel Processing, December 2015, pp. 51–60 (2015). https://doi.org/10.1109/ICPP.2015.14

  8. Riesinger, C., Bakhtiari, A., Schreiber, M., Neumann, P., Bungartz, H.J.: A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters. Computation 5(4), 48 (2017). https://doi.org/10.3390/computation5040048. http://www.mdpi.com/2079-3197/5/4/48

    Article  Google Scholar 

  9. Said, I., Fortin, P., Lamotte, J., Calandra, H.: Leveraging the accelerated processing units for seismic imaging: a performance and power efficiency comparison against CPUs and GPUs. Int. J. High Perform. Comput. Appl. (2017). https://doi.org/10.1177/1094342017696562

  10. Schepke, C., Diverio, T.A.: Distribuição de Dados para Implementações Paralelas do Método de Lattice Boltzmann. Ph.D. thesis, Universidade Federal do Rio Grande do Sul (2007)

    Google Scholar 

  11. Schepke, C., Maillard, N., Navaux, P.O.A.: Parallel lattice Boltzmann method with blocked partitioning. Int. J. Parallel Program. 37(6), 593–611 (2009). https://doi.org/10.1007/s10766-009-0113-x

    Article  MATH  Google Scholar 

  12. Tang, P., Song, A., Liu, Z., Zhang, W.: An implementation and optimization of lattice Boltzmann method based on the multi-node CPU+MIC heterogeneous architecture. In: 2016 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), no. 1, pp. 315–320 (2016). https://doi.org/10.1109/CyberC.2016.67, http://ieeexplore.ieee.org/document/7864252/

  13. Valero-Lara, P., Jansson, J.: Heterogeneous CPU+GPU approaches for mesh refinement over lattice-Boltzmann simulations. Concurr. Comput. 29, 1–20 (2017). https://doi.org/10.1002/cpe.3919

    Article  Google Scholar 

  14. Xian, W., Takayuki, A.: Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster. Parallel Comput. 37(9), 521–535 (2011). https://doi.org/10.1016/j.parco.2011.02.007

    Article  MathSciNet  Google Scholar 

  15. Ye, Y., Li, K., Wang, Y., Deng, T.: Parallel computation of entropic lattice Boltzmann method on hybrid CPU-GPU accelerated system. Comput. Fluids 110, 114–121 (2015). https://doi.org/10.1016/j.compfluid.2014.06.002

    Article  MathSciNet  MATH  Google Scholar 

  16. Zhou, Y., He, F., Qiu, Y.: Accelerating image convolution filtering algorithms on integrated CPU-GPU architectures. J. Electron. Imaging 27(3) (2018). https://doi.org/10.1117/1.JEI.27.3.033002

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Freytag .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Freytag, G., Navaux, P.O.A., Lima, J.V.F., Schnorr, L.M., Rech, P. (2019). Non-uniform Domain Decomposition for Heterogeneous Accelerated Processing Units. In: Senger, H., et al. High Performance Computing for Computational Science – VECPAR 2018. VECPAR 2018. Lecture Notes in Computer Science(), vol 11333. Springer, Cham. https://doi.org/10.1007/978-3-030-15996-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-15996-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-15995-5

  • Online ISBN: 978-3-030-15996-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics