In Situ Visualization of Performance-Related Data in Parallel CFD Applications

. This paper aims at investigating the feasibility of using Par-aView as visualization software for the analysis and optimization of parallel CFD codes’ performance. The currently available software tools for reading proﬁling data do not match the generated measurements to the simulation’s original mesh and somehow aggregate them (rather than showing them on a time-step basis). A plugin for the open-source performance tool Score-P has been developed, which intercept an arbitrary number of manually selected code regions (mostly functions) and send their respective measurements – amount of executions and cumulative time spent – to ParaView (through its in situ library, Catalyst), as if they were any other ﬂow-related variable. Results show that (i) the impact of mesh partition algorithms on code performance and (ii) the load imbalances (and their eventual relationship to mesh size/simulation physics) become easier to investigate.


Introduction
Many tools for analyzing the performance of parallel applications exist; one example of them is Score-P 1 [11], whose development the University of Dresden participates in. It acts as a wrapper which encapsulates the original code, thus can be easily turned on or off by the user at compilation stage. This is illustrated in Fig. 1 below.
As a separate category of add-ons, tools for enabling in situ visualization [5] of applications' output data (like temperature or pressure in a CFD simulation) already exist too; one example is Catalyst 2 [3]. It also works as an optional 1 Scalable Performance Measurement Infrastructure for Parallel Codes -an opensource "highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications": https://www.vi-hps.org/projects/score-p/. 2 An open-source "in situ use case library, with an adaptable application programming interface (API), that orchestrates the delicate alliance between simulation and analysis and/or visualization tasks": https://www.paraview.org/in-situ/.
parallel application performance add-on output add-on to the original code and can be activated upon request, by means of preprocessor directives at compilation stage (Fig. 2). This paper's goals are two-fold. First, unify the overlapping functionalities of both kinds of tools insofar as they augment a parallel application with additional functionality which is not strictly required for the application to work in the first place. Both collect or "steal" data from the parallel application and transfer it out via a side channel. Second, make use of the advanced visualization functionalities of dedicated visualization software tools for the purpose of performance analysis. With this we propose to map parallel performance properties to the simulation geometry as it is already done for flow-related properties. Figure 3 illustrates the idea.
The high-performance computing (HPC) performance tools usually output either performance profiles or event traces. In the case of Score-P, they are: -performance profiles in the Cube4 format to be visualized at Cube 3 [14]; -parallel event traces in the OTF2 format to be visualized at Vampir 4 [10].
But neither of them, nor the other currently available performance tools (to be explained in Sect. 2), match their measurements to the original simulation's geometry; what makes the proposal novel. On the other hand, the proposal is deemed also useful as, especially in CFD applications, the partitioning of the compute mesh for parallelization has direct influence on performance and load balancing. Hence for performance analysis and optimization a combined view into simulation properties and performance properties is helpful. A design requirement is that the combined solution must be able to be integrated into a parallel code easily, yet without becoming a permanently required component. Instead, it needs to be easy to switch on and off on demand, as it is for each of its constitutive parts. As evaluation case, the Rolls-Royce in-house CFD code (Hydra) [12] will be used.
None of them, however, currently match the generated data back to the simulation's geometry. Furthermore, displaying profiling results on a time-step basis is not straightforward. This paper would like to address those issues.

Prerequisites
The goal aimed by this research depends on the combination of two basic, scientifically established methods: performance measurement and in situ processing.

Performance Measurement
When applied to a source file's compilation, Score-P automatically inserts probes between each code "region" (mostly function calls, but also constructors, destructors etc.), which will at run-time measure: -the number of times that region was executed, and; -the total time spent in those executions.
By each rank/thread within the simulation. Its application is done by simply prepending the word scorep into the compilation command, e.g.: scorep mpicc foo.c. The tool is also equipped with an API, which allows the user to extend its functionalities through plugins [15]. The combined solution proposed by this paper takes the form of such a plugin.

In Situ Processing
In order for Catalyst to interface with the simulation code, an adapter needs to be built, which is responsible for exposing the native data structures (mesh and flow properties) to the coprocessor component. Its interaction with the simulation code happens through three function calls, illustrated in Fig. 4.
Once implemented, the adapter allows the generation of post-mortem files (by means of the VTK 13 [16] library) and/or the live visualization of the simulation, both through ParaView 14 [2]. 11 A "a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior": http://www.scalasca.org/. 12 A tool suite that "supports users to improve the energy-efficiency of their HPC applications": https://www.readex.eu/. 13 An open-source "software for manipulating and displaying scientific data": https:// www.vtk.org/. 14 An open-source "multi-platform data analysis and visualization application": https://www.paraview.org/.

Combining Both Tools
A Score-P plugin has been developed, which allows performance measurements for an arbitrary number of manually selected code regions to be pipelined to the simulation's Catalyst adapter. It must be activated at run-time through an environment variable (export SCOREP SUBSTRATE PLUGINS=Catalyst), but works independently of Score-P's profiling mode being actually on or off. Figure 5 illustrates the modifications needed in the source. Apart from the three basic calls (initialize, "run" and finalize; like with the Catalyst adapter), a call must be placed immediately before each function to be pipelined; e.g.: #ifdef CATALYST_SCOREP ! add this region to the list of plugin variables CALL cat_sco_pipeline_me() #endif CALL desired_function(argument_1, argument_2...) The above layout ensures that the desired function will be captured when executed at that specific moment and not in others (if the same routine is called multiple times -with different inputs -throughout the code, as it is usual for CFD simulations). The selected functions may or not be nested.
Finally, the user needs to add a small piece of code into the Catalyst adapter's source, in order for the plugin-generated variables to be pipelined (together with the traditional simulation variables), as shown in Fig. 6. It contains two vectors because for each selected region inside the simulation's code, the plugin will generate two variables (which correspond to the two basic measurements made by Score-P; see above).

Settings
Hydra is Rolls-Royce's in-house CFD code [12]. Figure 9 shows the test case selected for this paper: it represents a generic Q3D idealized model for a turbine stage. Preliminary analyses with Score-P → Cube revealed two code functions to be especially time-consuming: iflux edge and vflux edge (both mesh-related); they were selected for pipelining.
All simulations were done using an entire node in Dresden University's HPC cluster (Taurus), with 12 ranks (i.e. pure MPI, no OpenMP), one per core, each with the entire core memory (3875 MB) available. One full engine's shaft rotation was simulated, comprised of 100 time-steps (i.e. one per 3,6°), each internally converged through 40 iteration steps. Catalyst was generating postmortem output files every fifth time-step (i.e. every 18°), what led to 20 "stage pictures" by the end of the simulation. Finally, version 4.0 of Score-P was used in association with release 2018a of Intel ® compilers.

Results
Hydra supports multiple mesh partition algorithms, selectable at run-time. We compared them with our newly proposed approach. Figure 7 shows the time spent inside the two chosen functions in two different grid partitions: the upper images refer to geometric mesh partitioning and the lower ones were produced using ParMETIS; 15 the left-hand side pictures refer to function iflux edge, whereas the right-hand side to vflux edge. Here only one time-step is represented, butas opposed to the traditional way of visualizing profiling results (which aggregate multiple time-steps into one single measurement) -in ParaView it is possible to see each time-step individually and even play them (as frames of a video). Finally, the minimum and maximum thresholds in each of the four pictures' scales are adjusted to comprise all time-steps.
The analysis of the results reveals that, when compared against the geometric mesh partition, using ParMETIS brings slight benefits to the selected functions' performance: the overall maximum execution time (per time-step) drops in both of them, the overall minimum in vflux edge; and the max/min ratio of the execution time (per time-step) for both of them is also decreased.
Playing the saved time-steps in ParaView reveals a trend in all four layouts: the slowest/fastest rank to execute each function is always the same. This means there are still load imbalances when using ParMETIS; otherwise, the slowest/fastest rank should randomly change each time-step (due to stochastic phenomena at hardware-level during run-time). See the respective video. Figure 8 compares the results when profiling is activated (below) or not (above). They let clear that doing simultaneous code profiling significantly slows each region's execution time, but the max/min ratio remains roughly the same: This means the overhead associated with each feature (Score-P's profiling and/or the plugin) is linear, hence the results are valid from a comparative point of view. Indeed, playing the respective video reveals the same trend (slowest/fastest rank) as in the previous comparison.
Finally, the generated performance variables are accessible also live (interactively) in ParaView. In Fig. 9, notice the "catalyst" icon on the Pipeline Browser, as well as the presence of the selected code regions' measurements among the Data Arrays. Table 1 analyses the impact of the proposed plugin on the code's performance. ParMETIS was used for mesh partitioning.

Memory.
The "memory" row in Table 1 refers to the peak memory consumption per rank, reached somewhen during the simulation. From the numbers it is clear that the memory overhead introduced by Score-P is negligible (less than 10%); and that the memory overhead introduced by the plugin is also negligible. It may even require less memory than doing the traditional profiling (depending upon the number of code regions being pipelined) and, in our case, was below the statistical margin of oscillation (given profiling + plugin took less memory than profiling only). Indeed, in order to pipeline the two code functions shown above, it was not necessary to increase the default amount of memory (16 MB) that Score-P reserves for itself. Time. The run-time overhead is more critical and is shown in Table 1 with two cases. The light-weight instrumentation case shows the overhead of the presented approach with a sensible set of instrumented subroutines as it may have been achieved with carefully selecting the most interesting subroutines for the performance analysis process. This is the suggested way according to the Score-P documentation. In that case, the plugin produces a run-time overhead of 3%. This is less than Score-P in profiling mode with 5%. If both are used together, the overhead adds up. This is a sensible overhead and suitable for practical performance analysis. The second case with heavy-weight instrumentation reflects the worst-case scenario where some short subroutines are called very frequently (several billion times in this example). In that case, the overhead can dominate the entire run-time and the performance analysis insights are not reflecting the pristine parallel performance behavior. However, this scenario in Table 1 shows that our plugin behaves similar to Score-P in profiling mode; actually even slightly better with 71% overhead compared to 79%.

Conclusions and Future Work
Visualization techniques are usually not the specialization field of researches working with code performance: it is more reasonable to take advantage of the currently available graphic programs (like ParaView) than attempting -from scratch -to equip the existing profiling tools with their own GUIs. In this threshold, the developed plugin adds to the currently available spectrum of performance optimization resources the capacity to: -match performance-related measurements against the simulation's mesh, what makes the impact of grid partition algorithms on code performance easier to investigate; -analyze performance-related measurements on a time-step basis, what makes the load imbalances (and their eventual relationship to mesh size/flow physics) easier to diagnose.
We plan to extend this work in multiple directions: More Extensive Evaluation Cases. To run the plugin in bigger test cases, as the difficulty in matching each parallel region's id number with its respective grid part (hence the benefit of matching performance data back to the simulation's mesh) increases with scaling. Concomitantly, to run the plugin in test cases which comprise regions with distinct flow physics, when the computational load becomes less dependent on the number of points/cells per domain and more dependent on the flow features themselves (given their non-uniform occurrence): chemical reactions in the combustion chamber, shock waves in the inlet/outlet (at the supersonic flow regime), air dissociation in the free-stream/inlet (at the hypersonic flow regime) etc.

Improve and Further Integrate Tool's Runtime Components.
To automatize the selection of code regions to be pipelined, what currently needs to be manually done by the user at compile time (as shown in Sect. 4).
Develop New Visualization Schemes for Performance Data. To take advantage of the multiple filters available in ParaView for the benefit of the performance optimization branch, e.g. by recreating in it the statistical analysisdisplay of average and standard deviation between the threads/ranks' measurements -already available in other tools.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.