In this section, we exemplify the key features and assess the parallel performance of our framework by performing UQ studies of representative applications. In particular, we compare the time to solution as well as the computational cost and PDF estimation efficiency for two engineering applications requiring significant computational resources. These applications exhibit significant TTS for a single posterior evaluation and target multi-core and hybrid CPU/GPU clusters. Furthermore, they demonstrate the coupling of third-party parallel scientific software into our framework.
4.1 TMCMC and CMA-ES on a GPU Cluster
We perform UQ+P in the most widely used MD model, that of water. We use a 5-site water model, TIP5P-E. The calibration data consist of the radial distribution function of oxygen-oxygen in bulk water and its experimental uncertainty. Each evaluation of a posterior sample requires two full MD-simulation run, with the MD-code GROMACS 5.0 compiled with hybrid CPU-GPU acceleration. The final posterior value is computed by applying a post-processing stage which invokes a Matlab script that processes the output of the simulation run. The prediction error matrix \({\varSigma }\) can be decomposed into three contributions with elements \({\varSigma }_{ii}=\sigma _{exp}^2+\sigma _{ens}^2+\sigma _m^2\). We estimate the \(\sigma _{ens}^2\approx 0.005\). The experimental uncertainty contributions \(e^{exp}\) are known and finally, the additional model prediction error term \(\sigma _m^2\) is left to be determined from the inference process [16]. The parameters \(\left( \mathrm {\epsilon _{O-O}^{LJ}},\mathrm {\sigma _{O-O}^{LJ}}\right) \) and \(q_{\mathrm {O}}\) are the Lennard-Jones interaction parameters and charge interaction respectively. We use truncated Gaussian priors for the three parameters with mean values based on the literature values for TIP5P [17], with a standard deviation of \(30\,\%\) of \(\bar{\theta }_{\pi }\), whereas the hyperparameter follows a Gamma prior, that is \(\sigma ^2_m\sim \Gamma (1.5,0.25)\).
Results. We present the timings and the results of the calibration of the TIP5-P water model. We performed our simulations on 32 compute nodes of the Piz Daint Cray XC30 cluster at the Swiss National SuperComputing Center CSCS. Each node is equipped with an 8-core Intel Xeon E5-2670 processor and one NVIDIA Tesla K20X GPU. TORC is initialized with a single worker per node because each single posterior evaluation task fully utilizes a compute node by means of the hybrid CPU/GPU configuration of GROMACS. Posterior evaluations are invoked by a separate proxy server process that receives a set of parameters, invokes the GROMACS model executions, the Matlab-based post-processing phase and finally sends back the posterior value. This approach, depicted in Fig. 1, minimizes runtime overheads because the Matlab environment is initialized only once and, furthermore, it offers high flexibility and portability.
Each posterior evaluation requires between 17 and 21 min of wall clock-time in the above mentioned computing architecture. The variation of the mean time for completing each posterior evaluation is due to the different runtime for different initial parameters. The variance in the evaluation time and the maximum chain length are the main sources of load imbalance in this application. We address the first issue by using 256 samples per generation, i.e. 8x the number of workers, while we alleviate the second problem by sorting the chains according to their length and then evenly distributing the total workload to the available workers. The maximum chain length determines the lowest possible processing time for each generation and the maximum number of workers above which execution time does not improve and parallel efficiency is negatively affected.
Figure 2 (top, left) depicts the efficiency of TMCMC, while Fig. 2 (top, right) depicts how the time of a single posterior evaluation varies over a total of 15 generations. The above solutions, along with the stealing mechanism of TORC, minimize the idle time of workers and result in parallel efficiency higher than 97 % when every worker executes the same number of posterior evaluations. The lower efficiency (\(\approx \)88.4 %) for the 12th and 14th generation of TMCMC is attributed to the fact that the maximum chain length was equal to 9 for both cases, which imposes an upper limit of 88 % to the expected efficiency. Similar behavior is observed in Fig. 2 (bottom) for the parallel CMA-ES, where parallel efficiency and statistics for the evaluation time are reported every 10 generations. We notice that the measured parallel efficiency is equal to 90.1 % at the end of the 10th generation, which is due to the lower number of samples (64) per generation and the high variance of the evaluation time. This variance decreases as the algorithm evolves and the efficiency increases accordingly up to 97.4 %.
The computational cost of the MD calibration with the two methods is presented in Table 1. The mean parameter estimates as well as their associated uncertainty are summarized in Table 2. The coefficient of variation \(u_{\theta }\) of a parameter \(\theta \) is defined as the sample standard deviation of that parameter over its estimated mean \(\bar{\theta }\).
Table 1. Computational effort of the MD calibration.
Table 2. Mean values and Coefficient of Variation of the posterior distribution of the model parameter, along with the LogEvidence values of each model class.
4.2 ABC-Subsim on a Multicore Cluster
As a stochastic model we took the calibration of the Lennard-Jones potential parameters for helium. To perform the calibration we used the data on the Boltzmann factor \(f_B = \left\langle \exp \left( -\frac{H}{k_B T }\right) \right\rangle \) where H is the enthalpy of the system of helium atoms, T is the temperature of the system, \(k_B\) is the Boltzmann constant and \(\langle \cdot \rangle \) denotes the ensemble average. The data was generated using the software LAMMPS for a system of 1000 atoms for 20 ns in the NPT ensemble with a timestep of 2fs. The system used for calibration consists of 1000 atoms and is equilibrated for 2ns, following a production run in the NPT ensemble for another 2ns with a 2fs timestep. We performed calibration with 2 different settings. 1) Assuming the resulting Boltzmann factor distribution was Gaussian, and a discrepancy function of: \(\rho (x,y) = \sqrt{\left( (\mu _x-\mu _y)/\mu _x\right) ^2 + \left( (\sigma _x-\sigma _y)/\sigma _x\right) ^2}\). In the second setting the discrepancy is the given: \(\rho (x,y) = D_{KL}(P||Q) \) where \(D_{KL}\) is the Kullback-Leibler divergence, P is the data distribution, Q is the simulation outcome distribution of the Boltzmann factor.
Results. The algorithm runs a full molecular dynamic simulation for every parameter set and hence requires a significant amount of computational work. It also exhibits two levels of parallelism, as the Markov chains with different seeds can be processed in parallel while each single simulation can also run in parallel using the MPI version of LAMMPS.
The time to solution for each function evaluation varies with the given parameters, introducing load imbalance in the algorithm. We deal with this issue by submitting tasks with higher execution time first: we sort the samples according to the value of the \(\sigma \) parameter before distributing the corresponding function evaluation or Markov chain tasks to the workers. Moreover, we enhance this scheme with the task stealing of TORC.
Table 3. Detailed per-level performance results of ABC-SubSim on 512 nodes of Piz Daint. \(T_{f}\) shows the mean and standard deviation of the simulation times and \(T_{w}\) is the wall-clock time per generation, respectively. All the times are reported in seconds.
Table 4. Prior and posterior information of parameters of the Helium system in molecular LAMMPS units. The number of generations \(N_{gen}\) computed before the acceptance rate reached a threshold value of 5 % and achieved tolerance levels \(\delta \) for two models: \(M_{G}\) [Gaussian setting], \(M_{KL}\) [Kullback-Leibler setting]. Prior bounds \([\theta _l, \theta _r]\), mean values \(\bar{\theta }\) and coefficients of variation \(u_{\theta }\) of the Lennard-Jones parameters of Helium.
We performed our simulations on 512 compute nodes of the Piz Daint cluster (4096 cores in total). TORC is initialized with two MPI workers per node and each LAMMPS simulation utilizes 4 cores in turn. The population size was set to be 15360 and the Markov chain length was equal to 5. The algorithm stops when the acceptance rate drops below 5 %.
Table 3 summarizes the parallel performance of ABC-SubSim. Despite the high variance of the time for a single simulation run, we observed that the efficiency of the initialization phase (level 0) reaches 82 % as 15360 function evaluations are distributed among the 1024 workers. The lower efficiency (70.5 %) of Level 1 is attributed to the existence of chains with high accumulated running times and the small number of available chains that correspond to each worker (3072 chains in total, 3 chains per worker). As the algorithm evolves, the efficiency increases and reaches 92 % for the last level, which exhibits a load imbalance of approximately 8 % as computed by \((T_{max} - T_{avg})/T_{avg}\), where \(T_{max}\) and \(T_{avg}\) are the maximum and average time that the workers were busy during the processing of the specific level. The information about the prior and the posterior values of the parameters is given in Table 4.