Abstract
After reading this chapter, you will understand the fundamentals of high-performance computing and how to write efficient code for lattice Boltzmann method simulations. You will know how to optimise sequential codes and develop parallel codes for multi-core CPUs, computing clusters, and graphics processing units. The code listings in this chapter allow you to quickly get started with an efficient code and show you how to optimise your existing code.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
This is analogous to dividing decimal numbers by powers of 10 by “shifting” right.
- 3.
Pointers are variables that hold the address of another variable. See Appendix A.9.6 for more details.
- 4.
For readers unfamiliar with assembly language or the instructions shown here for a typical modern 64 bit Intel processor, push and pop are instructions that save and retrieve their parameter from the “stack,” a special memory region where data can be stored temporarily. The instruction mov dst,src copies the contents of src to dst where src and dst may be locations in memory or registers. QWORD PTR [addr] refers to the contents of the quadword (four words, which is eight bytes) at the location addr in memory. Numbers written as 0xhh represent the value hh in base 16 (hexadecimal). The symbols rax, rbp, and rsp denote 64 bit general purpose registers, and xmm0, xmm1, and xmm2 are registers for floating point values. Note that these are 128 bit floating point registers that can store two double precision values or four single precision values, but in this code only the lower 64 bits are used. The instruction movsd dst,src means “move scalar double” and copies src to dst using only the lowest 64 bits if a register is specified. movapd dst,src moves the full 128 bit value from src to dst. The instructions addsd dst,src and mulsd dst,src are scalar addition and multiplication instructions, repectively, that store the result of adding/multiplying dst and src to dst. The function’s parameters are provided in the registers xmm0-2 and its result is returned in xmm0. Execution continues in the calling function after the instruction ret.
- 5.
Only calculating these values is not enough; they must be used somehow or the compiler will discard the unnecessary calculations.
- 6.
On a command line, we can do this with output redirection. For example, ./sim > sim.out on a Unix command line (or sim.exe > sim.out in Windows) runs the program sim and saves its output to the text file sim.out. The output is not shown on the screen. To both display and save the output we can use (on Unix systems) the tee command: ./sim | tee sim.out where we have used a pipe, |, to send the output of one program to the input of another, in this case tee.
- 7.
Where the term “node” is potentially ambiguous, we use the more specific terms “computing node” and “lattice node” for clarity.
- 8.
- 9.
Depending on how the MPI implementation combines the output from the different processes, this synchronisation might not have the desired effect. Later output from rank 1, for example, might appear before any output from rank 0. If it is essential for the order of output to be synchronised, the data to be output from all ranks can be sent to one rank that displays it all in the correct order.
- 10.
MPI_Testall is a variant of MPI_Test. The variants of MPI_Test are analogous to those of MPI_Wait: MPI_Testany, MPI_Testsome, and MPI_Testall.
- 11.
This matches the size of size_t on the systems used for testing, but it is not portable and may need to be changed for other systems.
- 12.
mpirun, mpiexec, and orterun are synonyms in Open MPI.
- 13.
The analysis of how a program uses memory and computing resources is called profiling. Automatic profiling software typically reports the time taken by the most time-consuming functions in a program and is useful for optimisation.
- 14.
In textile weaving, a warp is a collection of parallel threads through which other thread, called the weft, is interlaced.
- 15.
A macro is a compiler shortcut that allows programmers to conveniently use a fragment of code in many places. When preparing code for compilation, the compiler system replaces the name of the macro with the corresponding code fragment.
- 16.
This strict ordering of memory accesses is not necessary in general. The memory accesses within a warp are combined as long as they involve a contiguous block of memory regardless of the details of which threads access which locations in memory.
References
Institute of Electrical and Electronics Engineers. 754-2008 — IEEE standard for floating-point arithmetic (2008). http://dx.doi.org/10.1109/IEEESTD.2008.4610935
H.S. Warren Jr., Hacker’s Delight, 2nd edn. (Addison-Wesley, Boston, 2013)
U. Drepper. What every programmer should know about memory (2007). https://www.akkadia.org/drepper/cpumemory.pdf
S. Chellappa, F. Franchetti, M. Püschel, in Generative and Transformational Techniques in Software Engineering II: International Summer School, GTTSE 2007, Braga, Portugal, July 2–7, 2007. Revised Papers, ed. by R. Lämmel, J. Visser, J. Saraiva (Springer, Berlin, Heidelberg, 2008), pp. 196–259
M. Wittmann, T. Zeiser, G. Hager, G. Wellein, Comput. Math. Appl. 65, 924 (2013)
D.A. Bikulov, D.S. Senin, Vychisl. Metody Programm. 3, 370 (2013). This article is in Russian.
OpenMP Architecture Review Board. About the OpenMP ARB and OpenMP.org. http://openmp.org/wp/about-openmp/
OpenMP Architecture Review Board. OpenMP application program interface (2011). http://www.openmp.org/mp-documents/OpenMP3.1.pdf. Version 3.1
OpenMP Architecture Review Board. OpenMP application programming interface (2015). http://www.openmp.org/mp-documents/openmp-4.5.pdf. Version 4.5
B. Barney. OpenMP. https://computing.llnl.gov/tutorials/openMP/
Message Passing Interface Forum. Message Passing Interface (MPI) Forum Home Page. http://www.mpi-forum.org/
TOP500. November 2015 TOP500 supercomputer sites. http://www.top500.org/lists/2015/11/
Message Passing Interface Forum. MPI: A Message-Passing Interface standard (2008). http://www.mpi-forum.org/docs/mpi-1.3/mpi-report-1.3-2008-05-30.pdf. Version 1.3
The Open MPI Project. Open MPI: Open Source High Performance Computing. https://www.open-mpi.org/
Message Passing Interface Forum. MPI documents. http://www.mpi-forum.org/docs/docs.html
The Open MPI Project. Open MPI documentation. https://www.open-mpi.org/doc/
B. Barney. Message Passing Interface (MPI). https://computing.llnl.gov/tutorials/mpi/
W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable parallel programming with the Message-Passing Interface, 3rd edn. (MIT Press, Cambridge, 2014)
Adaptive Computing, Inc. TORQUE resource manager. http://www.adaptivecomputing.com/products/open-source/torque/
Khronos Group. OpenCL. https://www.khronos.org/opencl/
OpenACC. Directives for accelerators. http://www.openacc.org/
NVIDIA. CUDA toolkit documentation. http://docs.nvidia.com/cuda/
NVIDIA. CUDA code samples. https://developer.nvidia.com/cuda-code-samples
NVIDIA. CUDA toolkit documentation. http://docs.nvidia.com/cuda/cuda-samples/
J. Sanders, E. Kandrot, CUDA by Example: An Introduction to General Purpose GPU Programming (Addison-Wesley, Boston, 2010)
NVIDIA. CUDA downloads. https://developer.nvidia.com/cuda-downloads
NVIDIA. CUDA quick start guide. http://docs.nvidia.com/cuda/pdf/CUDA_Quick_Start_Guide.pdf
NVIDIA. CUDA C best practices guide (2015). http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
W. Li, X. Wei, A. Kaufman, Visual Comput. 19, 444 (2003)
A. Kaufman, Z. Fan, K. Petkov, J. Stat. Mech. 2009, P06016 (2009)
J. Tölke, Comput. Visual. Sci. 13, 29 (2010)
J. Tölke, M. Krafczyk, Int. J. Comput. Fluid. D. 22, 443 (2008)
M.J. Mawson, A.J. Revell, Comput. Phys. Commun. 185, 2566 (2014)
O. Shardt, J.J. Derksen, S.K. Mitra, Langmuir 29, 6201 (2013)
O. Shardt, S.K. Mitra, J.J. Derksen, Langmuir 30, 14416 (2014)
A.E. Komrakova, O. Shardt, D. Eskin, J.J. Derksen, Int. J. Multiphase Flow 59, 24 (2014)
A.E. Komrakova, O. Shardt, D. Eskin, J.J. Derksen, Chem. Eng. Sci. 126, 150 (2015)
W. Xian, A. Takayuki, Parallel Comput. 37, 521 (2011)
X. Li, Y. Zhang, X. Wang, W. Ge, Chem. Eng. Sci. 102, 209 (2013)
J. McClure, H. Wang, J.F. Prins, C.T. Miller, W.C. Feng, in Parallel and Distributed Processing Symposium, 2014 IEEE 28th International (2014), pp. 583–592
A. Gray, A. Hart, O. Henrich, K. Stratford, Int. J. High Perform. C. 29, 274 (2015)
C. Obrecht, F. Kuznik, B. Tourancheau, J.J. Roux, Comput. Fluids 54, 118 (2012)
C. Obrecht, F. Kuznik, B. Tourancheau, J.J. Roux, Comput. Math. Appl. 65, 252 (2013)
C. Obrecht, F. Kuznik, B. Tourancheau, J.J. Roux, Comput. Fluids 80, 269 (2013)
C. Obrecht, F. Kuznik, B. Tourancheau, J.J. Roux, Comput. Math. Appl. 61, 3628 (2011)
F. Kuznik, C. Obrecht, G. Rusaouen, J.J. Roux, Comput. Math. Appl. 59, 2380 (2010)
C. Obrecht, F. Kuznik, B. Tourancheau, J.J. Roux, Parallel Comput. 39, 259 (2013)
M. Schreiber, P. Neumann, S. Zimmer, H.J. Bungartz, Procedia Comput. Sci. 4, 984 (2011)
H. Zhou, G. Mo, F. Wu, J. Zhao, M. Rui, K. Cen, Comput. Methods Appl. Mech. Eng. 225–228, 984 (2011)
M. Schönherr, K. Kucher, M. Geier, M. Stiebler, S. Freudiger, M. Krafczyk, Comput. Math. Appl. 61, 3730 (2011)
C. Obrecht, F. Kuznik, B. Tourancheau, J.J. Roux, Comput. Math. Appl. 65, 936 (2013)
H. Liu, Q. Kang, C.R. Leonardi, S. Schmieschek, A. Narváez, B.D. Jones, J.R. Williams, A.J. Valocchi, J. Harting, Comput. Geosci. 20, 777 (2016)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Krüger, T., Kusumaatmaja, H., Kuzmin, A., Shardt, O., Silva, G., Viggen, E.M. (2017). Implementation of LB Simulations. In: The Lattice Boltzmann Method. Graduate Texts in Physics. Springer, Cham. https://doi.org/10.1007/978-3-319-44649-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-44649-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44647-9
Online ISBN: 978-3-319-44649-3
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)