Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures
Abstract
The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version of LCG pseudorandom number generator which requires limited amount of memory. We introduce two new implementations of the algorithm based on directivebased parallel programming standards OpenMP and OpenACC and consider their performance using Hockney–Jesshope theoretical model of vector computations. We also present and discuss the results of experiments performed on dualprocessor Intel Xeon E52670 computers with Intel Xeon Phi 7120P and NVIDIA K40m.
Keywords
Multidimensional Monte Carlo integration Vectorized linear congruential generator Performance analysis GPU Intel MIC OpenMP OpenACC1 Introduction
In [28, 29] we have shown that the multidimensional Monte Carlo integration can be efficiently implemented on various distributed memory parallel computers and clusters of multicore GPUaccelerated nodes using recently developed parallel versions of LCG and LFG pseudorandom number generators [26]. Unfortunately, those implementations use all available memory to generate in parallel a huge number of random points. In such a case it is necessary to increase the number of nodes within a cluster when higher accuracy of results is desired.
The aim of this paper is to introduce a new simplified algorithm for multidimensional Monte Carlo integration based on a new vectorized version of LCG pseudorandom number generator which requires limited amount of memory. We consider its performance using Hockney–Jesshope model of vector computations [6, 4]. This model can also be applied to find the values of important parameters of the algorithm to minimize its execution time. Our new portable implementations are based on two directivebased parallel programming standards OpenMP and OpenACC; thus, the algorithm can be run on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures. We also present and discuss the results of experiments performed on dualprocessor Intel Xeon E52670 computers with Intel Xeon Phi 7120P and NVIDIA K40m.
2 Multidimensional Monte Carlo integration
 1.
c is relatively prime to m,
 2.
every prime factor of m also divides \(a1\),
 3.
if 4 divides m, then 4 divides \(a1\).
Algorithm 2 shows how to apply (7) to perform multidimensional Monte Carlo integration. First we have to find vectors \(\mathbf {x}_0\), \(\mathbf {y}\), \(\mathbf {t}\) using three separate tasks that can be executed in parallel. Note that because of data dependencies, the loops 2–4, 6–8, 10–12 are strictly sequential. Then we convert all entries of \(\mathbf {x}_0\) to real values (line 13), using fully vectorized operation and find the sum of \(f(\mathbf {v}_j)\) for \(j=1,\ldots , s1\) (lines 15–17, parallel loop). In the second part of the algorithm, we find all vectors \(\mathbf {x}_i\), \(i=1,\ldots ,r1\), convert their entries to real values using fully vectorized operations (lines 20 and 21, respectively) and find (lines 22–24, parallel loop) the sum of \(f(\mathbf {v}_j)\). It should be noticed that although the operations from lines 13, 20, 21 are written in a vector form, they can be treated as loops without data dependencies.
3 Performance analysis
It is clear that the performance of Algorithm 2 depends on chosen values of r and s. Our experiments show that the right choice of these parameters can improve the performance significantly. Moreover, the right choice remains appropriate for various functions f (see Sect. 5). Thus, for the sake of simplicity, we will use the theoretical model of vector computations introduced by Hockney and Jesshope [6, 4] to analyze Algorithm 2 reduced to the problem of finding \(r\cdot s\) vectors from \(I^d\). It will also help us to predict the right choice of r and s that can minimize the execution time of the algorithm.
It should be noticed that the theoretical model of performance based on (10) gives us some information about possible scalability of vectorized algorithms. When loops are split over available p processors or cores, the values of parameters \(r_{\infty }\), \(n_{1/2}\) grow by a factor of p. However, due to synchronization, \(r_{\infty }\) grows slower, but \(n_{1/2}\) faster [4]. Therefore, when the number of processors grows, our method can achieve better performance for bigger problem sizes (i.e., when parallelized loops are sufficiently large).
4 Implementation issues
Both Algorithms 1 and 2 have been implemented in C. The implementation of Algorithm 1 is straightforward; however, one can suppose that a modern compiler can apply some optimization techniques that can improve the overall performance of the algorithm on modern CPUs. For example, Intel C/C++ compiler applies the loop fission technique in which each of the loops 3–6 and 11–14 is broken into two loops. The first one is strictly sequential and performs \(x_j\leftarrow ax_{j1}+c\), while the second one performs \(v_j\leftarrow x_j/m\) and can be vectorized using new SIMD extensions like AVX and AVX512 which are available in modern multicore and manycore processors [27, 9]. It can be enforced by placing the pragma simd before each loop [27]. To optimize memory access the array, \(\mathbf {v}\) should be allocated using the _mm_malloc() intrinsic. It works just like the malloc function and additionally allows data alignment [27]. This loop has limited length (i.e., d); thus, the use of multiple threads cannot be profitable.
Our OpenACC implementation of Algorithm 2 assumes that the most computeintensive part of the algorithm, namely the lines 18–25, will be offloaded to an accelerator (see Fig. 2). We use the OpenACC data construct to specify the scope of data in the accelerated region. The construct parallel loop is used to vectorize the internal loops 20, 21, 22–24. Note that the variable result resides in host memory and it is updated using the value of temp. We also use the OpenACC construct update host to guarantee that the actual value of the last entry of x resides in host memory.
It should be noticed that PGI compiler which supports OpenACC has one disadvantage, namely it does not support indirect calls to external functions within accelerated regions. Instead one should consider the use of inlined functions supported by the compiler. Unfortunately, in such a case the source code of the implementation should be recompiled for each function.
5 Results of experiments
 E52670:

a server with two Intel Xeon E52670 v3 (totally 24 cores with hyperthreading, 2.3 GHz), 128GB RAM, running under CentOS 6.5 with Intel Parallel Studio ver. 2016,
 Xeon Phi:

like E52670, additionally with Intel Xeon Phi Coprocessor 7120P (61 cores with multithreading, 1.238 GHz, 16GB RAM), all experiments have been carried out on Xeon Phi working in native mode [8, 21],
 K40m:

like E52670, additionally with NVIDIA Tesla K40m GPU [13] (2880 cores, 12GB RAM), CUDA 7.0 and Portland Group compilers and tools with OpenMP and OpenACC support.
Estimated \(\beta \) and exemplary values of \(s^\star \) for K40m, Xeon Phi and two E52670
n  \( {d}\)  Optimal values of \(\beta \) and s  

K40m, \(\beta =7744\)  Xeon Phi, \(\beta =1024\)  E52670, \(\beta =2822\)  
\(1e+06\)  4  44000  16000  26561 
16  22000  8000  13281  
64  11000  4000  6640  
\(1e+07\)  4  139140  50596  83994 
16  69570  25298  41997  
64  34785  12649  20999 
First we have tested the performance of the Algorithms 1 and 2 for the test functions proposed in [5] using various values of r and s. We have observed that the right choice of these parameters can significantly improve the performance of Algorithm 2. Moreover, the optimal values of these parameters minimize the execution time for various test functions.
Figure 3 shows the performance of Algorithm 2 reduced to the problem of finding \(r\cdot s\) vectors from \(I^d\). We can observe that the optimal value of s depends on the problem size, namely n and d. It is also different for each architecture. After these experiments we have evaluated the approximation of the parameter \(\beta \) using (16). Then applying (14) and (15) we have obtained theoretical approximation of optimal values of s for various n and d (see Table 1).
Execution time of Algorithm 1 (only on E52670) and Algorithm 2 for Continuous and NAG test functions
n  Continuous  NAG test  

Alg.1 (E5)  E5  Phi  K40m  Alg.1 (E5)  E5  Phi  K40m  
\(1e+05\)  0.002  0.002  0.023  0.004  0.001  0.002  0.026  0.004 
\(1e+06\)  0.017  0.005  0.030  0.008  0.012  0.008  0.035  0.012 
\(1e+07\)  0.162  0.025  0.078  0.032  0.118  0.035  0.097  0.044 
\(1e+08\)  1.623  0.190  0.268  0.138  1.157  0.297  0.355  0.209 
\(1e+09\)  16.921  2.483  2.031  0.765  11.060  6.136  3.798  1.601 
Execution time of Algorithm 1 (only on E52670) and Algorithm 2 for Corner peak and Product peak functions
d  n  Corner peak  Product peak  

Alg.1 (E5)  E5  Phi  K40m  Alg.1 (E5)  E5  Phi  K40m  
4  \(1e+05\)  0.008  0.013  0.034  0.006  0.005  0.003  0.024  0.006 
\(1e+06\)  0.075  0.024  0.040  0.023  0.047  0.007  0.033  0.023  
\(1e+07\)  0.746  0.076  0.111  0.102  0.475  0.036  0.071  0.125  
\(1e+08\)  7.459  0.634  0.479  0.580  5.086  0.418  0.244  0.889  
\(1e+09\)  74.586  6.417  4.752  4.966  46.596  6.454  2.929  8.203  
16  \(1e+05\)  0.015  0.025  0.032  0.008  0.013  0.014  0.027  0.009 
\(1e+06\)  0.145  0.060  0.050  0.030  0.125  0.030  0.034  0.039  
\(1e+07\)  1.448  0.163  0.164  0.163  1.245  0.141  0.105  0.246  
\(1e+08\)  14.483  1.840  1.163  1.135  12.472  1.380  0.759  1.895  
\(1e+09\)  144.811  13.626  10.260  9.823  124.566  12.860  8.037  17.239  
64  \(1e+05\)  0.043  0.034  0.042  0.016  0.041  0.034  0.036  0.019 
\(1e+06\)  0.428  0.084  0.089  0.075  0.404  0.089  0.056  0.092  
\(1e+07\)  4.274  0.479  0.345  0.597  4.036  0.439  0.230  0.729  
\(1e+08\)  42.729  5.340  3.529  4.452  40.356  5.395  3.019  6.620  
\(1e+09\)  427.951  50.657  30.761  38.499  404.294  70.072  27.361  65.353 
We can observe that Algorithm 2 outperforms Algorithm 1 significantly for all architectures and test functions. However, the use of Algorithm 2 is much more profitable for bigger problem sizes and more complicated functions. In such cases, manycore architectures, namely Xeon Phi and K40m, outperform E52670. For Corner peak and Product peak functions Xeon Phi outperforms K40m for \(n>1e+06\) and the advantage is greater for \(d\ge 16\). K40m works fine for smaller values of d when coalesced memory access can take place, i.e., when multiple memory accesses into a single transaction. [17, 16]. Similarly, GPUs outperform Xeon Phi when integrand functions have calls to transcendental functions. In such a case plenty of cores can be utilized. Table 3 shows that the form of integrand functions has a great influence on the performance of the algorithm on individual architectures. In case of Xeon Phi the performance of Algorithm 2 for Corner peak and Product peak is nearly the same, while K40m requires almost twice as much time for Product peak than for Corner peak. Computations on GPUs are much more effective when computations performed by cores are rather simple, but the form of Product peak is more complicated and requires more memory references. When the integrand function is really simple, K40m achieves much better performance than Xeon Phi and E52670 (see Fig. 4). The results presented in Figs. 4 and 5 also confirm theoretical considerations regarding scalability (see Sect. 3). Indeed, Xeon Phi and K40m (i.e., manycore architectures) achieve better speedup for bigger problem sizes and when the integrand function is rather simple. Too many memory references that may appear in more complex integrand functions can limit speedup of the method.
It should be pointed out that the performance of Algorithm 2 on GPUs could be improved by using CUDA [17] or OpenCL [14] programming interfaces. Unfortunately, it would require much more effort than using OpenACC. On the other hand, the implementation for Intel architectures can be optimized by using more sophisticated techniques like programming with intrinsics for Intel Advanced Vector Extensions [9].
6 Conclusions
We have showed that the multidimensional Monte Carlo integration based on a new vectorized version of linear congruential generator can be easily and efficiently implemented on modern CPU, GPU and Intel MIC architectures including Intel Xeon E52670, Xeon Phi 7120P and NVIDIA K40m using highlevel directivebased standards like OpenMP and OpenACC. The new version of LCG requires a limited amount of memory; thus, the number of generated pseudorandom points can be really huge. We have also shown how to use Hockney–Jesshope theoretical model of vector computations to find values of algorithm’s parameters to minimize its execution time.
Notes
Acknowledgements
This work was partially supported by the National Centre for Research and Development under MICLAB Project POIG.02.03.0024093/13. The use of computer resources installed at Institute of Mathematics, Maria CurieSkłodowska University, Lublin, is kindly acknowledged.
References
 1.Bull JM, Freeman TL (1994) Parallel globally adaptive quadrature on the KSR1. Adv Comput Math 2:357–373. https://doi.org/10.1007/BF02521604 MathSciNetCrossRefzbMATHGoogle Scholar
 2.Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R (2001) Parallel programming in OpenMP. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
 3.Chen C, Huang K, Lyuu Y (2015) Accelerating the leastsquare Monte Carlo method with parallel computing. The J Supercomput 71(9):3593–3608. https://doi.org/10.1007/s1122701514517 CrossRefGoogle Scholar
 4.Dongarra J, Duff I, Sorensen D, Van der Vorst H (1991) Solving linear systems on vector and shared memory computers. SIAM, PhiladelphiazbMATHGoogle Scholar
 5.Hahn T (2005) CUBAa library for multidimensional integration. Comput Phys Commun 168:78–95MathSciNetCrossRefzbMATHGoogle Scholar
 6.Hockney R, Jesshope C (1981) Parallel computers: architecture. Programming and algorithms. Adam Hilger Ltd., BristolzbMATHGoogle Scholar
 7.RW Hockney (1985) \((r_{{\infty }}, n_{{1/2}}, s_{{1/2}})\) measurements on the 2cpu CRAY XMP. Parallel Comput 2(1):1–14. https://doi.org/10.1016/01678191(85)900146 MathSciNetGoogle Scholar
 8.Jeffers J, Reinders J (2013) Intel Xeon Phi coprocessor highperformance programming. Morgan Kaufman, WalthamGoogle Scholar
 9.Jeffers J, Reinders J, Sodani A (2016) Intel Xeon Phi processor highperformance programming. Knights landing edition. Morgan Kaufman, CambridgeGoogle Scholar
 10.Knuth DE (1981) The art of computer programming, volume II: seminumerical algorithms, 2nd edn. AddisonWesley, BostonzbMATHGoogle Scholar
 11.Knuth DE (1999) MMIXware, a RISC computer for the third millennium, lecture notes in computer science, vol 1750. Springer, BerlinzbMATHGoogle Scholar
 12.Lautrup B (1971) An adaptive multidimensional integration procedure. In: Proceedings 2nd Colloquium Advanced Methods in Theoretical Physics, MarseilleGoogle Scholar
 13.Li Y, Schwiebert L, Hailat E, Mick JR, Potoff JJ (2016) Improving performance of GPU code using novel features of the NVIDIA Kepler architecture. Concurr Comput Pract Exp 28(13):3586–3605. https://doi.org/10.1002/cpe.3744 CrossRefGoogle Scholar
 14.Munshi A (2009) The OpenCL Specification v. 1.0. Khronos OpenCL Working GroupGoogle Scholar
 15.Niederreiter H (1978) QuasiMonte Carlo methods and pseudorandom numbers. Bull Am Math Soc 84:957–1041MathSciNetCrossRefzbMATHGoogle Scholar
 16.NVIDIA (2015) CUDA C best practices guide. NVIDIA Corporation, available at http://www.nvidia.com/
 17.NVIDIA Corporation (2015) CUDA programming guide. NVIDIA Corporation, available at http://www.nvidia.com/
 18.OpenACCstandardorg (2015) The OpenACC application programming interface, v2.5. Tech. rep., OpenACCStandard.org, http://www.openacc.org/sites/default/files/OpenACC_2pt5.pdf
 19.Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C, 2nd edn. Cambridge University Press, CambridgezbMATHGoogle Scholar
 20.Pryor DV, Burns PJ (1989) Vectorized monte carlo molecular aerodynamics simulation of the rayleigh problem. The J Supercomput 3(4):305–330. https://doi.org/10.1007/BF00128168 CrossRefGoogle Scholar
 21.Rahman R (2013) Intel Xeon Phi coprocessor architecture and tools: the guide for application developers. Apress, BerkelyCrossRefGoogle Scholar
 22.Ripoll DR, Thomas SJ (1992) A parallel monte carlo search algorithm for the conformational analysis of polypeptides. The J Supercomput 6(2):163–185. https://doi.org/10.1007/BF00129777 CrossRefzbMATHGoogle Scholar
 23.Sabne A, Sakdhnagool P, Lee S, Vetter JS (2014) Evaluating performance portability of OpenACC. In: Languages and Compilers for Parallel Computing27th International Workshop, LCPC 2014, Hillsboro, OR, USA, September 15–17, 2014, Revised Selected Papers, pp 51–66, https://doi.org/10.1007/9783319174730_4
 24.Santos EE, Rickman JM, Muthukrishnan G, Feng S (2008) Efficient algorithms for parallelizing Monte Carlo simulations for 2d ising spin models. The J Supercomput 44(3):274–290. https://doi.org/10.1007/s112270070163z CrossRefzbMATHGoogle Scholar
 25.Stpiczyński P (2011) Solving linear recurrence systems on hybrid GPU accelerated manycore systems. In: Proceedings of the Federated Conference on Computer Science and Information Systems, September 18–21, 2011, Szczecin, Poland, IEEE Computer Society Press, pp 465–470, https://fedcsis.org/proceedings/2011/pliks/148.pdf
 26.Stpiczyński P, Szałkowski D, Potiopa J (2012) Parallel GPUaccelerated recursionbased generators of pseudorandom numbers. In: Proceedings of the Federated Conference on Computer Science and Information Systems, September 9–12, 2012, Wroclaw, Poland, IEEE Computer Society Press, pp 571–578, http://fedcsis.org/proceedings/2012/pliks/380.pdf
 27.Supalov A, Semin A, Klemm M, Dahnken C (2014) Optimizing HPC applications with intel cluster tools. Apress, BerkelyCrossRefGoogle Scholar
 28.Szałkowski D, Stpiczyński P (2014) Multidimensional Monte Carlo integration on clusters with hybrid GPUaccelerated nodes. In: Parallel Processing and Applied Mathematics, 10th International Conference, PPAM 2013, Warsaw, Poland, September 8–11, 2013, Revised Selected Papers, Part I, Springer, Lecture Notes in Computer Science, vol 8384, pp 603–612, https://doi.org/10.1007/9783642552243_56
 29.Szałkowski D, Stpiczyński P (2015) Using distributed memory parallel computers and GPU clusters for multidimensional Monte Carlo integration. Concurr Comput Pract Exp 27(4):923–936. https://doi.org/10.1002/cpe.3365 CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.