Keywords

1 Introduction

Many tools for analyzing the performance of parallel applications exist; one example of them is Score-PFootnote 1 [11], whose development the University of Dresden participates in. It acts as a wrapper which encapsulates the original code, thus can be easily turned on or off by the user at compilation stage. This is illustrated in Fig. 1 below.

Fig. 1.
figure 1

Schematic of software components for performance analysis tools.

As a separate category of add-ons, tools for enabling in situ visualization [5] of applications’ output data (like temperature or pressure in a CFD simulation) already exist too; one example is CatalystFootnote 2 [3]. It also works as an optional add-on to the original code and can be activated upon request, by means of preprocessor directives at compilation stage (Fig. 2).

Fig. 2.
figure 2

Schematic of software components for in situ visualization.

This paper’s goals are two-fold. First, unify the overlapping functionalities of both kinds of tools insofar as they augment a parallel application with additional functionality which is not strictly required for the application to work in the first place. Both collect or “steal” data from the parallel application and transfer it out via a side channel. Second, make use of the advanced visualization functionalities of dedicated visualization software tools for the purpose of performance analysis. With this we propose to map parallel performance properties to the simulation geometry as it is already done for flow-related properties. Figure 3 illustrates the idea.

The high-performance computing (HPC) performance tools usually output either performance profiles or event traces. In the case of Score-P, they are:

  • performance profiles in the Cube4 format to be visualized at CubeFootnote 3 [14];

  • parallel event traces in the OTF2 format to be visualized at VampirFootnote 4 [10].

Fig. 3.
figure 3

Schematic of the software components for a combined add-on.

But neither of them, nor the other currently available performance tools (to be explained in Sect. 2), match their measurements to the original simulation’s geometry; what makes the proposal novel. On the other hand, the proposal is deemed also useful as, especially in CFD applications, the partitioning of the compute mesh for parallelization has direct influence on performance and load balancing. Hence for performance analysis and optimization a combined view into simulation properties and performance properties is helpful.

A design requirement is that the combined solution must be able to be integrated into a parallel code easily, yet without becoming a permanently required component. Instead, it needs to be easy to switch on and off on demand, as it is for each of its constitutive parts. As evaluation case, the Rolls-Royce in-house CFD code (Hydra) [12] will be used.

2 Related Work

Apart from Score-P \(\rightarrow \) Cube and Score-P \(\rightarrow \) Vampir (mentioned above), other workflows – with graphical support – used for performance analysis include:

  • HPCToolkitFootnote 5 [1], whose outputs are visualized through hpcviewer (profiling) and hpctraceviewer (tracing);

  • PeriscopeFootnote 6 [7], whose outputs are visualized through Pathway (an Eclipse-based graphical user-interface);

  • TauFootnote 7 [18], whose outputs are visualized through ParaProf [6] (profiling) and Vampir (tracing), among others;

  • ParaverFootnote 8 [13] and DimemasFootnote 9 [4], with integrated visualization capabilities;

  • SLOG-2, a drawable logging format visualized through JumpshotFootnote 10 [8];

  • ScalascaFootnote 11 [9] as an optional add-on to either Score-P or Tau;

  • READEXFootnote 12 [17], with a bunch of visualization options.

None of them, however, currently match the generated data back to the simulation’s geometry. Furthermore, displaying profiling results on a time-step basis is not straightforward. This paper would like to address those issues.

3 Prerequisites

The goal aimed by this research depends on the combination of two basic, scientifically established methods: performance measurement and in situ processing.

3.1 Performance Measurement

When applied to a source file’s compilation, Score-P automatically inserts probes between each code “region” (mostly function calls, but also constructors, destructors etc.), which will at run-time measure:

  • the number of times that region was executed, and;

  • the total time spent in those executions.

By each rank/thread within the simulation. Its application is done by simply prepending the word scorep into the compilation command, e.g.: scorep mpicc foo.c. The tool is also equipped with an API, which allows the user to extend its functionalities through plugins [15]. The combined solution proposed by this paper takes the form of such a plugin.

3.2 In Situ Processing

In order for Catalyst to interface with the simulation code, an adapter needs to be built, which is responsible for exposing the native data structures (mesh and flow properties) to the coprocessor component. Its interaction with the simulation code happens through three function calls, illustrated in Fig. 4.

Fig. 4.
figure 4

Illustrative example of changes needed in a simulation code due to Catalyst.

Once implemented, the adapter allows the generation of post-mortem files (by means of the VTKFootnote 13 [16] library) and/or the live visualization of the simulation, both through ParaViewFootnote 14 [2].

4 Combining Both Tools

A Score-P plugin has been developed, which allows performance measurements for an arbitrary number of manually selected code regions to be pipelined to the simulation’s Catalyst adapter. It must be activated at run-time through an environment variable (export SCOREP_SUBSTRATE_PLUGINS=Catalyst), but works independently of Score-P’s profiling mode being actually on or off. Figure 5 illustrates the modifications needed in the source.

Fig. 5.
figure 5

Illustrative example of further changes needed in the code due to the plugin.

Apart from the three basic calls (initialize, “run” and finalize; like with the Catalyst adapter), a call must be placed immediately before each function to be pipelined; e.g.:

figure a

The above layout ensures that the desired function will be captured when executed at that specific moment and not in others (if the same routine is called multiple times – with different inputs – throughout the code, as it is usual for CFD simulations). The selected functions may or not be nested.

Finally, the user needs to add a small piece of code into the Catalyst adapter’s source, in order for the plugin-generated variables to be pipelined (together with the traditional simulation variables), as shown in Fig. 6. It contains two vectors because for each selected region inside the simulation’s code, the plugin will generate two variables (which correspond to the two basic measurements made by Score-P; see above).

5 Early Evaluation

5.1 Settings

Hydra is Rolls-Royce’s in-house CFD code [12]. Figure 9 shows the test case selected for this paper: it represents a generic Q3D idealized model for a turbine stage. Preliminary analyses with Score-P \(\rightarrow \) Cube revealed two code functions to be especially time-consuming: iflux_edge and vflux_edge (both mesh-related); they were selected for pipelining.

All simulations were done using an entire node in Dresden University’s HPC cluster (Taurus), with 12 ranks (i.e. pure MPI, no OpenMP), one per core, each with the entire core memory (3875 MB) available. One full engine’s shaft rotation was simulated, comprised of 100 time-steps (i.e. one per 3,6°), each internally converged through 40 iteration steps. Catalyst was generating post-mortem output files every fifth time-step (i.e. every 18°), what led to 20 “stage pictures” by the end of the simulation. Finally, version 4.0 of Score-P was used in association with release 2018a of Intel® compilers.

Fig. 6.
figure 6

Addition needed in the Catalyst adapter’s code due to the plugin.

5.2 Results

Hydra supports multiple mesh partition algorithms, selectable at run-time. We compared them with our newly proposed approach. Figure 7 shows the time spent inside the two chosen functions in two different grid partitions: the upper images refer to geometric mesh partitioning and the lower ones were produced using ParMETIS;Footnote 15 the left-hand side pictures refer to function iflux_edge, whereas the right-hand side to vflux_edge. Here only one time-step is represented, but – as opposed to the traditional way of visualizing profiling results (which aggregate multiple time-steps into one single measurement) – in ParaView it is possible to see each time-step individually and even play them (as frames of a video). Finally, the minimum and maximum thresholds in each of the four pictures’ scales are adjusted to comprise all time-steps.

Fig. 7.
figure 7

Comparison between two code functions in two mesh partitions.

The analysis of the results reveals that, when compared against the geometric mesh partition, using ParMETIS brings slight benefits to the selected functions’ performance: the overall maximum execution time (per time-step) drops in both of them, the overall minimum in vflux_edge; and the max/min ratio of the execution time (per time-step) for both of them is also decreased.

Playing the saved time-steps in ParaView reveals a trend in all four layouts: the slowest/fastest rank to execute each function is always the same. This means there are still load imbalances when using ParMETIS; otherwise, the slowest/fastest rank should randomly change each time-step (due to stochastic phenomena at hardware-level during run-time). See the respective video.

Figure 8 compares the results when profiling is activated (below) or not (above). They let clear that doing simultaneous code profiling significantly slows each region’s execution time, but the max/min ratio remains roughly the same:

  • from \(0.57/0.46 \approx 1.24\) to \(0.75/0.60 = 1.25\) in iflux_edge;

  • from \(0.85/0.69 \approx 1.23\) to \(0.97/0.78 \approx 1.24\) in vflux_edge.

This means the overhead associated with each feature (Score-P’s profiling and/or the plugin) is linear, hence the results are valid from a comparative point of view. Indeed, playing the respective video reveals the same trend (slowest/fastest rank) as in the previous comparison.

Finally, the generated performance variables are accessible also live (interactively) in ParaView. In Fig. 9, notice the “catalyst” icon on the Pipeline Browser, as well as the presence of the selected code regions’ measurements among the Data Arrays.

Fig. 8.
figure 8

Comparison between two code functions when profiling is activated (below) or not (above).

5.3 Overhead

Table 1 analyses the impact of the proposed plugin on the code’s performance. ParMETIS was used for mesh partitioning.

Memory. The “memory” row in Table 1 refers to the peak memory consumption per rank, reached somewhen during the simulation. From the numbers it is clear that the memory overhead introduced by Score-P is negligible (less than 10%); and that the memory overhead introduced by the plugin is also negligible. It may even require less memory than doing the traditional profiling (depending upon the number of code regions being pipelined) and, in our case, was below the statistical margin of oscillation (given profiling + plugin took less memory than profiling only). Indeed, in order to pipeline the two code functions shown above, it was not necessary to increase the default amount of memory (16 MB) that Score-P reserves for itself.

Table 1. Analysis of time and memory overhead of the plugin.
Fig. 9.
figure 9

Geometry used in the simulations.

Time. The run-time overhead is more critical and is shown in Table 1 with two cases. The light-weight instrumentation case shows the overhead of the presented approach with a sensible set of instrumented subroutines as it may have been achieved with carefully selecting the most interesting subroutines for the performance analysis process. This is the suggested way according to the Score-P documentation. In that case, the plugin produces a run-time overhead of 3%. This is less than Score-P in profiling mode with 5%. If both are used together, the overhead adds up. This is a sensible overhead and suitable for practical performance analysis. The second case with heavy-weight instrumentation reflects the worst-case scenario where some short subroutines are called very frequently (several billion times in this example). In that case, the overhead can dominate the entire run-time and the performance analysis insights are not reflecting the pristine parallel performance behavior. However, this scenario in Table 1 shows that our plugin behaves similar to Score-P in profiling mode; actually even slightly better with 71% overhead compared to 79%.

6 Conclusions and Future Work

Visualization techniques are usually not the specialization field of researches working with code performance: it is more reasonable to take advantage of the currently available graphic programs (like ParaView) than attempting – from scratch – to equip the existing profiling tools with their own GUIs. In this threshold, the developed plugin adds to the currently available spectrum of performance optimization resources the capacity to:

  • match performance-related measurements against the simulation’s mesh, what makes the impact of grid partition algorithms on code performance easier to investigate;

  • analyze performance-related measurements on a time-step basis, what makes the load imbalances (and their eventual relationship to mesh size/flow physics) easier to diagnose.

We plan to extend this work in multiple directions:

More Extensive Evaluation Cases. To run the plugin in bigger test cases, as the difficulty in matching each parallel region’s id number with its respective grid part (hence the benefit of matching performance data back to the simulation’s mesh) increases with scaling. Concomitantly, to run the plugin in test cases which comprise regions with distinct flow physics, when the computational load becomes less dependent on the number of points/cells per domain and more dependent on the flow features themselves (given their non-uniform occurrence): chemical reactions in the combustion chamber, shock waves in the inlet/outlet (at the supersonic flow regime), air dissociation in the free-stream/inlet (at the hypersonic flow regime) etc.

Improve and Further Integrate Tool’s Runtime Components. To automatize the selection of code regions to be pipelined, what currently needs to be manually done by the user at compile time (as shown in Sect. 4).

Develop New Visualization Schemes for Performance Data. To take advantage of the multiple filters available in ParaView for the benefit of the performance optimization branch, e.g. by recreating in it the statistical analysis – display of average and standard deviation between the threads/ranks’ measurements – already available in other tools.