In this section we discuss the performance aspects of Sarus by comparing data for a variety of workloads, executed by corresponding native and containerized applications. It should be noted that the creation of the container images represents a best reproducibility effort in terms of software releases, compiler versions, compilation tool chains, compilation options and libraries, between native and containerized versions of the applications. Despite this effort, the exact reproducibility of the application binaries cannot be guaranteed. However, this highlights how the Docker workflow of packaging applications using mainstream Linux distributions (e.g. Ubuntu) and basic system tools (e.g. package managers) can seamlessly integrate with HPC environments, since container images targeted at personal workstations can still be used to achieve consistently comparable performance with native installations.
For each data point, we present the average and standard deviation of 50 runs, to produce statistically relevant results, unless otherwise noted. For a given application, all repetitions at each node count for both native and container execution were performed on the same allocated set of nodes.
We conduct our tests on Piz Daint, a hybrid Cray XC50/XC40 system in production at the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland. The compute nodes are connected by the Cray Aries interconnect under a Dragonfly topology, providing users access to hybrid CPU-GPU nodes. Hybrid nodes are equipped with an Intel® Xeon® E5-2690v3 processor, 64 GB of RAM, and a single NVIDIA® Tesla® P100 with 16 GB of memory. The software environment on Piz Daint is the Cray Linux Environment 6.0.UP07 (CLE 6.0) [7] using Environment Modules [9] to provide access to compilers, tools, and applications. The default versions for the NVIDIA CUDA and MPI software stacks are, respectively, CUDA version 9.1, and Cray MPT version 7.7.2.
We install and configure Sarus on Piz Daint to use runc, the native MPI (H1) and NVIDIA Container Runtime (H2) hooks introduced in Sect. 4, and to mount container images from a Lustre parallel filesystem [31].
5.1 Scientific Applications
In this section, we test three popular scientific application frameworks, widely used in both research and industry. The objective is to demonstrate the capability of Sarus and its HPC extensions (see Sect. 4) to run real-world production workloads using containers, while performing on par with highly-tuned native deployments. All container runs in the following subsections use both the MPI and NVIDIA GPU hooks.
GROMACS. GROMACS [2] is a molecular dynamics package with an extensive array of modeling, simulation and analysis capabilities. While primarily developed for the simulation of biochemical molecules, its broad adoption includes research fields such as non-biological chemistry, metadynamics and mesoscale physics.
For the experiment we select the GROMACS Test Case B from PRACE’s Unified European Applications Benchmark Suite [32]. The test case consists of a model of cellulose and lignocellulosic biomass in an aqueous solution. This inhomogeneous system of 3.3 million atoms uses reaction-field electrostatics instead of smooth particle-mesh Ewald (SPME) [10], and therefore should scale well. The simulation was carried out using single precision, 1 MPI process per node and 12 OpenMP threads per MPI process. We perform runs from a minimum of 4 nodes up to 256 nodes, increasing the node count in powers of two, carrying out 40 repetitions for each data point.
As the native application we use GROMACS release 2018.3, built by CSCS staff and available on Piz Daint through an environment module. For the container application, we build GROMACS 2018.3 with GPU acceleration inside a Ubuntu-based container.
The results are illustrated in Fig. 2. We measure performance in ns/day as reported by the application logs. The speedup values are computed using the performance average of each data point, taking the native value at 4 nodes as baseline.
We observe the container application consistently matching the scalability profile of native version. Absolute performance is identical up to 32 nodes. From 64 nodes upwards, the differences (up to \(2.7\%\) at 256 nodes) are consistent with empirical experience about sharing the system with other users during the experiment. Standard deviation values remain comparable across all node counts.
TensorFlow with Horovod. TensorFlow [1] is a popular open source software framework for the development of machine-learning (ML) models, with a focus on deep neural networks.
Horovod [36] is a framework to perform distributed training of deep neural networks on top of another ML framework, like TensorFlow, Keras, or PyTorch. Notably, it allows to replace TensorFlow’s own parameter server architecture for distributed training with communications based on an MPI model, providing improved usability and performance.
As test case, we select the tf_cnn_benchmark scripts from the Tensorflow project [38] for benchmarking convolutional neural networks. We use a ResNet-50 model [14] with a batch size of 64 and the synthetic image data which the benchmark scripts are able to generate autonomously. We perform runs from a minimum of 2 nodes up to 512, increasing the node count in powers of two.
For the native application, we install Horovod 0.15.1 on Piz Daint on top of a CSCS-provided build of TensorFlow 1.7.0, which is available on Piz Daint through an environment module. For the container application, we customize the Dockerfile provided by Horovod for version 0.15.1, which is based on Ubuntu, to use TensorFlow 1.7.0 and MPICH 3.1.4. Neither application uses NVIDIA’s NCCL library for any MPI operation.
The results are shown in Fig. 3. We measure performance in images per second as reported by the application logs and compute speedup values using the performance averages for each data point, taking the native performance at 2 nodes as baseline.
The container application shows a performance trend identical to the native one, with both versions maintaining close standard deviation values. Absolute performance differences up to 8 nodes are less than \(0.5\%\). From 16 nodes upwards we observe differences up to \(6.2\%\) at 256 nodes, which are compatible with empirical observations of running experiments on a non-dedicated system.
COSMO Atmospheric Model Code. The Consortium for Small Scale Modeling (COSMO) [6] develops and maintains a non-hydrostatic, limited-area atmospheric model which is used by several institutions for climate simulations [4, 21] and production-level numerical weather prediction (NWP) [3, 20, 34] on a daily basis.
As test cases, we replicate the strong and weak scaling experiments performed in [13]. These experiments use a refactored version 5.0 of the GPU-accelerated COSMO code to simulate an idealized baroclinic wave [15]. The setup expands the computational domain of COSMO to cover 98.4% of the Earth surface and discretizes the vertical dimension using 60 stretched mode levels, going from the surface up to 40 km. For the complete details of the experimental setup, please refer to [13].
For the native application, we build the COSMO code and its dependencies on the Cray Programming Environment 18.08 available on Piz Daint. For the container application, in order to meet the specific requirements of the COSMO code, we package the Cray Linux Environment 6.0UP07 and the Cray Developer Toolkit 18.08 in a Docker image. This image is subsequently used as the base environment in which the containerized COSMO stack is built.
Weak scaling. The experiments for weak scaling employ two domain sizes with \(160 \times 160\) and \(256 \times 256\) horizontal grid points per node, respectively. We perform runs from a minimum of 2 nodes up to 2336, carrying out 5 repetitions at each node count. For each run, the mean wall-clock time of a simulation step is calculated by taking the total duration of the time loop as reported by the application log and dividing it for the number of time steps. Figure 4(a) displays the average, minimum and maximum values of mean step duration for each data point. The performance trend of native and container applications is identical across all node counts and domain sizes. The absolute performance differences range from \(0.3\%\) to \(1\%\) on the \(256 \times 256\) per-node domain and from \(1\%\) to \(1.6\%\) on the \(160 \times 160\) per-node domain, which are within the observed variability for running on a shared system.
Strong scaling. The experiments for strong scaling employ a fixed domain with a 19 km horizontal grid spacing. We perform runs from a minimum of 32 nodes up to 2888, carrying out 5 repetitions at each node count. We measure performance in simulated years per day (SYPD), calculated from the simulation time step and the wall-clock duration of the time loop as reported by the application log. Figure 4 (b) displays the average, minimum and maximum values of SYPD for each data point. Again, native and container performances follow very similar trends throughout all deployment scales and show very small variations. Differences range from \(0.8\%\) to \(2.1\%\). In the same fashion of the other experiments, such differences are consistent with previous experience about not having the system dedicated to these experiments.