1 Introduction

The explosive growth of computing power and the development of simulation technologies have enabled fast, large-scale, and high fidelity computational fluid dynamics (CFD) simulations, which are widely used in various scientific and engineering fields. In large-scale CFD simulations, interactive visualization of simulation data followed by exploration of simulation parameters is becoming very difficult because of the increase in the simulation data size and the computational cost. In situ steering is a promising solution to this issue. This approach visualizes and controls simulations on supercomputers at runtime without re-executing them and, thus, can greatly reduce users’ efforts in debugging simulation codes and exploring simulation parameters. In addition, the human-in-the-loop approach enables intuitive feedback to complicated simulations, which is useful in exploring the optimal solutions and analyzing inverse problems without vast parameter scans. For example, in the plume dispersion analysis code CityLBM, which is the target of this study, we solve inverse problems in emergencies such as nuclear terrorism or accidents to estimate pollutant sources based on the plume dispersion analysis and the monitoring data.

The first requirement in in situ steering is the capability of real-time in situ visualization. In recent supercomputers, I/O speed cannot keep up with accelerated computation; thus, in situ visualization has become essential to avoid the I/O bottleneck. Moreover, state-of-the-art exascale supercomputers with multi-core CPUs and GPUs have limitations on the communication bandwidth and the memory size. Thus, the following issues have become apparent in the conventional polygon-based in situ visualization, which renders simulation data on supercomputers.

  1. 1.

    Interactive changes of visualization parameters such as transfer functions and camera positions are difficult, because any change requires processing visualization on supercomputers.

  2. 2.

    The data size of polygons generated over the whole volume data is large, leading to insufficient memory and enormous costs for data transfer.

  3. 3.

    The domain decomposition used in the simulation results in costly global communication for visibility ordering of polygon data.

We resolved these problems by developing the in situ particle-based volume rendering framework, IS-PBVR (Kawamura et al. 2017a; Kawamura and Idomura 2020). IS-PBVR is tightly coupled with the simulation and supports client/server type interactive remote visualization between a supercomputer that processes the simulation and a user PC that displays the visualization results. The server program on the supercomputer is optimized on many-core CPUs using PBVR that does not involve global communication and rapidly compresses large volume data into megabyte-scale visualization particle data. This results in low computational and memory costs on the supercomputer, low data transfer costs, and fast rendering on the user PC. IS-PBVR transfers visualization parameters and the particle data between the client program on the user PC and the server program on the supercomputer via the storage on the supercomputer, which works on supercomputers based on various architectures.

In this work, we extend IS-PBVR to an interactive in situ steering framework that generally works for large-scale CFD simulations on GPU supercomputers. To avoid interference with the simulation on the GPU, the simulation data on the GPU is transferred to the host CPU, where the in situ visualization is processed. This is an intermediate implementation between the loosely coupled model where visualization is processed on different computation nodes and the tightly coupled model where visualization is processed on the GPU. Additionally, we extend PBVR to block structured AMR data, which is generated from CityLBM. As visualization functions, we implement GUIs that present three-dimensional (3D) volume data by PBVR and time-series data at monitoring points by 2D graphs. In situ steering functions are extended by adding the simulation parameters and the monitoring data to the communication protocol of IS-PBVR.

In the numerical experiments, we apply the developed framework to CityLBM on the GPU supercomputer SGI8600 and demonstrate that in situ visualization and steering are possible without interfering with the real-time plume dispersion analysis. Using the in situ steering capability, we address an inverse problem where the pollutant source is estimated from the plume dispersion analysis and the monitoring data. Our contributions in this work are listed in the following.

  1. 1.

    IS-PBVR is extended to efficiently visualize block structured AMR data of multi-scale CFD simulations on GPU supercomputers.

  2. 2.

    An in situ steering framework is constructed by developing asynchronous in situ steering functions supported by in situ data monitoring and by extending the communication protocol and GUI in IS-PBVR.

  3. 3.

    IS-PBVR achieved real-time visualization and steering for the plume dispersion analysis with over 100 million grids running on the GPU supercomputer SGI8600.

The remainder of the paper is organized as follows: Sect. 2 outlines the related works. Section 3 describes IS-PBVR, which is a basis for the proposed in situ steering framework. Section 4 describes the extension of IS-PBVR on GPU supercomputers, and Sect. 5 presents the design of the in situ steering framework. Section 6 introduces CityLBM used in the numerical experiments, and Sect. 7 demonstrates the effectiveness and the computational performance of the proposed method.

2 Related work

Since the 1990s, there have been efforts to realize computational steering by integrating analysis and visualization into the simulation workflow (Rowlan et al. 1994). VASE (Jablonowski et al. 1993), SCIRun (Parker et al. 1995; Johnson et al. 1999), and CUMULVS (Geist et al. 1997) were early computational steering frameworks. VASE is a visualization and steering framework designed to work with existing Fortran codes. VASE defined abstract steering functions as subroutines and provided a tool to create an application control workflow by managing those subroutines. SCIRun is a programming environment that employs an object oriented dataflow programming model. SCIRun can interactively control the simulation by varying the boundary conditions, geometry, and simulation parameters. This steering function is tightly coupled to the simulation and is executed when the simulation is paused. CUMULVS is a framework that allows multiple users to monitor and steer the simulation remotely. It not only provides access to the distributed data at runtime, but also supports checkpoint-based fault tolerance. It also uses the visualization software AVS to display the results of 3D data visualization.

Beazley et al. proposed a workflow-based steering approach using a scripting language which is employed in commercial software, such as MATLAB, IDL, and Mathematica (Beazley and Lomdahl 1996). A steerable workflow was built for the large-scale 3D molecular dynamics simulation code SPaSM, and MATLAB was used for visualization. Harrop et al. developed a steering tool called TierraLab, which is an extension of the seismic tomography code, on a MATLAB interface (Harrop et al. 1998). Modi et al. proposed a lightweight steering system, POSSE, based on a client and server programming model and demonstrated the utility of the system in wake-vortex simulations of multiple aircrafts running on a Beowulf cluster (Modi et al. 2002). In the 2000s, in situ visualization and steering of large-scale simulations on supercomputers were reported. Tu et al. proposed in situ steering in Hercules, an integrated simulation framework for terascale earthquake simulations on Cray XT3 (Tu et al. 2006a, b). Hercules was developed based on a function to interact with compute nodes from the outside, and demonstrated real-time visualization and steering over a wide area network. Matthes et al. proposed ISAAC, a C++ template library that allows visualization on GPUs and processes a server program on a login node to stream rendered images (Matthes et al. 2016). The library can send user-defined steering commands asynchronously to the running application. ISAAC works on various computing platforms including GPUs. In numerical experiments, in situ visualization of particle-in-cell simulations was performed using 4096 GPUs. In 2011, libraries and frameworks to support large-scale in situ visualization and steering were released in the two general-purpose visualization tools, ParaView and VisIt. ICARUS is a ParaView plug-in for in situ visualization (Biddiscombe et al. 2011). ICARUS uses a virtual file driver for HDF5 for data access and runs on a loosely coupled environment. The steering simulation parameters are passed using a shared file interface. ParaView co-processing, called Catalyst, is a visualization library that extends ParaView and is tightly coupled with the simulation to enable in situ visualization and steering using visualization functions on ParaView (Fabian et al. 2011). Libsim is a simulation and visualization library based on VisIt (Whitlock et al. 2011). Libsim is tightly coupled to the simulation and pauses the simulation during data manipulation. VisIt can combine various compute engines to build a visualization process. VisIt can also incorporate simulation as one of the compute engines to handle in situ visualization and steering on a common interface. Rivi et al. reported a case study of in situ visualization and steering using VisIt Libsim and ICARUS (Rivi et al. 2012). VisIt Libsim was tightly coupled to the BrainCore code, which simulates a network model of the brain, and ICARUS was tightly coupled to the astrophysics code PLUTO, to show examples of restart control and steering of simulations. Hong et al. steered a large-scale turbulence simulation in a complex geometry of subchannels in a nuclear reactor (Yi et al. 2014). The steering environment was built by incorporating ParaView Catalyst into a finite element solver PHASTA, and the bubble flow rate was adjusted by steering the pressure gradient. Performance measurements on the Cray XK7 TITAN showed that the cost of the in situ visualization increased from 3.4 to 10.3% of the simulation when the number of cores was varied from 2048 to 32,768. Additionally, by varying the number of cores from 4096 to 32,768 on IBM BlueGene/Q MIRA, the cost of the in situ visualization increased from 8.8 to 32.2% of the simulation. Buffat et al. demonstrated in situ visualization and steering with a loosely coupled approach using 2048 cores for direct numerical simulations of channel turbulence and 512 cores for visualization with VisIt (Buffat et al. 2017). The cost of image generation by VisIt was equivalent to 20 time steps, and the visualization process was executed every 50 iterations. In situ steering of the simulation parameters was achieved by editing the python script referenced with VisIt. Although the above works addressed in situ steering of extreme scale CFD simulations, their in situ visualization functions were limited to isosurfaces and slices, while volume rendering is well known as one of the most efficient visualization approaches for CFD simulations. Yi et al. discussed that use of transparency in the co-processing pipeline led to data redistribution among computing cores in order to create a visibility reordering, which did not scale well to the large number of cores and resulted in prohibitive performance overhead when running on Titan with 18,432 cores (Yi et al. 2014). In this work, we resolve this bottleneck using PBVR, and enable volume rendering and use of transparency, which can efficiently support in situ steering without interfering the simulation.

3 In situ visualization as a basis for interactive steering

3.1 Requirements for in situ visualization

To build a framework that enables in situ steering for real-time CFD simulations optimized on GPU supercomputers, the costs associated with interactive processing and the capabilities of visualization analysis must meet the following requirements.

  1. 1.

    A client/server type distributed processing model that enables interactive and real-time changes of visualization parameters such as transfer functions and camera positions.

  2. 2.

    Massively parallel visualization at sufficiently low computational and memory costs that do not interfere with the simulation.

  3. 3.

    Flexible visualization functions that enable synthesis and visualization of multi-variate volume data.

Although various in situ visualization and steering techniques have been developed, only few frameworks have fully satisfied the above requirements. In this section, we review IS-PBVR from the above viewpoints.

3.2 Overview of PBVR

IS-PBVR is an interactive in situ visualization framework developed based on particle-based volume rendering (PBVR) (Sakamoto et al. 2010; Kawamura et al. 2010). PBVR is a projection-based volume rendering method that calculates pixel values by converting volume data into particle data and projecting it onto the screen.

Volume rendering is generally based on the density emitter model (Sabella 1988; Williams and Max 1992), in which opaque, tiny, and self-emitting particles are distributed inside the volume element. In this model, an alpha value is defined as the light attenuation calculated from the light shielding rate at the volume element. In traditional volume rendering methods, the interaction of light within the volume element is approximated by alpha blending, and the pixel value is calculated by repeating alpha blending in order along the line of sight. However, PBVR directly approximates the density emitter model using opaque particles generated by the Monte Carlo method.

PBVR generates the particle distribution based on that particle density. The alpha blending on the line of sight is then computed approximately by projecting the particles onto the screen while taking into account shielding. If the particle distribution is sufficiently random, the light shielding rate at the volume element is determined only by the local particle density and is not changed by the viewing direction. Since the particle data is view-independent, there is no need to regenerate particles by changing camera positions.

In the particle generation stage, the probability density function of the Monte Carlo method is defined by the particle density which is calculated from opacity. The particle data consists of position, color, and a normal vector. Since only the particles necessary for calculating pixel values are generated, the particle data is compressed to tens to hundreds of megabytes, regardless of the size of the original volume data. Additionally, the same algorithm can be used for structured, unstructured, and block structured AMR grids by simply changing the interpolation function, since the particles are generated in an element by element manner.

In the rendering stage, the particles are projected onto a screen buffer where each pixel is subdivided into sub-pixels. The pixels on the image buffer are divided into sub-pixels in order to calculate the projection of particles smaller than a pixel. The number of required particles increases in proportion to the number of pixel divisions s (sub-pixel level), thus refining the image. The sub-pixel processing produces the image buffer with a s2 times higher resolution, which puts pressure on GPU memory. A repeat process has been developed that achieves the same effect as the sub-pixel processing by dividing the same particle data into s2 subsets, repeating the projection for s2 times, and finally superimposing the images. Sakamoto et al. (2010) showed that by increasing the number of particles, which is proportional to s2, the image quality improves and approaches the traditional volume rendering.

3.3 Visualization functions in IS-PBVR

IS-PBVR is equipped with a flexible design interface of multi-dimensional transfer functions that enables various visualization analysis for multi-variate volume data (Kawamura et al. 2017b). Users can design a multi-dimensional transfer function as a generic function of multiple 1D transfer functions by using algebraic expressions based on users mathematical and physical images for multi-dimensional data space.

The transfer function editor enables users to define two types of algebraic expressions: one for generating new variables from multiple variables, and the other for synthesizing 1D transfer functions into a multi-dimensional transfer function. The new variables, which are defined as arbitrary functions of multiple variables and coordinates (x, y, z) in the simulation, are computed for each element and particle position at runtime during in situ visualization. Since this implementation does not require copying the volume data, low-memory in situ data analysis is possible. Algebraic expressions are described using the four arithmetic operations and basic functions including logarithmic, trigonometric, and staircase functions. Additionally, differential operators in (x, y, z) are available for computing derivatives of variables. These algebraic expressions are calculated using a mathematical function parser optimized with SIMD operations on many-core CPUs (Kawamura and Idomura 2020).

The multi-dimensional transfer function includes general visualization functions such as isosurfaces, slice-surfaces, and clipping. Isosurfaces are generated by volume rendering for the opacity function, which has localized peaks at particular physical values. On the other hand, slice-surfaces and clipping are generated by giving peaks and a segment to the opacity function for each coordinate, respectively. It is also possible to synthesize multiple visualization objects by summing their transfer functions.

3.4 CS-PBVR

The conventional client/server visualization using polygon-based visualization techniques has bottlenecks in terms of massively parallel visualization and data transfer. Volume rendering with polygons requires visibility ordering for alpha blending, which requires costly communication between the computation nodes on the supercomputer and reduces the scalability of parallel visualization. Additionally, it is necessary to transfer large visualization data consisting of grids or polygons to the client PC at each change of visualization parameters. Since the data compression rate of polygon-based visualization data is low, it is difficult to apply this approach to terabyte-scale volume data.

A client/server remote visualization framework, CS-PBVR (Kawamura et al. 2014), was built to visualize large volume data on the remote supercomputer (Fig. 1, top). This framework is based on a client/server type distributed processing model, in which particle generation and projection of PBVR are, respectively, processed on the supercomputer and the user PC. CS-PBVR consists of two components, Server and Client, which run on the interactive node on the supercomputer and the user PC, respectively. Server processes particle generation and data transfer using a master/slave MPI parallelization model. Here, the domain decomposition model and the MPI communicator are unchanged from those in the simulation. The slave processes handles particle generation for decomposed domains (MPI), where each element is processed in thread parallel (OpenMP). On the other hand, the master process collects the generated particle data and transfers it to Client. In the conventional parallel volume rendering, visibility ordering for alpha blending causes a lot of costly inter-node communication, resulting in performance degradation. In CS-PBVR, Client processes particle projection including visibility ordering on the graphics processor on the user PC, and parallel visualization on Server becomes almost embarrassingly parallel and shows ideal scalability. Since the particle data is megabyte-scale and independent of viewpoint, the amount of data transfer is small, and interactive changes of camera positions are easily processed on the user PC. In the transfer function editor, statistical information of the variables assigned to each 1D transfer function is calculated, and minimum and maximum values and histograms are shown on the user interface.

Fig. 1
figure 1

Components and data flows of CS-PBVR (top), IS-PBVR (middle), and in situ Steering (bottom). Across the vertical line representing the internet, the supercomputer consisting of the interactive node, the storage, and the computation nodes is shown on the left, while the user PC is on the right. The components for visualization are represented by the gray boxes. The blue arrows indicate the monitoring data and the particle data, and the red arrows illustrate the simulation parameters and the visualization parameters

3.5 IS-PBVR

The in situ visualization has the advantage of avoiding the I/O bottleneck, but interactive visualization was difficult for batch processing simulations on supercomputers. We extended CS-PBVR to build an interactive in situ visualization framework IS-PBVR (Fig. 1, middle). IS-PBVR consists of three components: Sampler, Daemon, and Client, which run on the computation nodes, the interactive node, and the user PC, respectively. Here, Sampler and Daemon, respectively, correspond to particle generation and data transfer parts of Server in CS-PBVR. Sampler, which is tightly coupled to the simulation, loads a visualization parameter file, generates particle data, and outputs distributed files of the particle data on the storage at each time step. Daemon, which is a key component in the interactive process, communicates with Client via socket communication and with Sampler via files on the storage. Daemon collects the particle data files on the storage and sends the collected data to Client. It also receives visualization parameters from Client, and updates the visualization parameter file on the storage. This implementation enables an asynchronous and interactive file-based control of in situ visualization. On recent supercomputers, the storage is shared between the interactive and computation nodes, and socket communication to the interactive node via ssh tunnels is supported. Therefore, the above interactive visualization approach works on most supercomputers.

IS-PBVR has been developed as a class library in C++. It supports both structured and unstructured grids. The simulation and Sampler are tightly coupled, and visualization functions in the library are called during the time step loop in the simulation. As explained in Sect. 3.4, PBVR can be processed under the same domain decomposition and the MPI communicator as the simulation, in situ visualization can be easily implemented by inserting the particle generation function, ParticleGeneration(grid, value, geom), into the time step loop. Here, “grid”, “value”, and “geom” are structure data generated from the simulation. “grid” is the grid data (including connection information in case of the unstructured grid). “value” is the simulation data consisting of arbitrary number of variables, which support both array of structure and structure of array formats. “geom” includes the number of elements, the minimum and maximum values of coordinates, and the halo data information (only for domain decomposition in the structured grid). This library can be used not only with C++ but also with C and Fortran by using the wrapper interfaces for C and Fortran.

IS-PBVR was applied to the in situ visualization of multi-variate data in severe accident analysis of nuclear reactors using 1536 KNL processors on the Oakforest-PACS (Kawamura and Idomura 2020). The computational cost and the memory usage of Server were, respectively, suppressed to 0.3% and 3% of the simulation. Here, the remaining particle data I/O was completely masked by task parallel implementation between the simulation and the I/O thread using C++ STL containers, and excellent strong scaling was achieved up to 98,304 cores.

IS-PBVR enabled low cost massively parallel visualization on many-core CPU supercomputers, interactive visualization via the file-based control, and flexible visualization functions for multi-variate data. Therefore, IS-PBVR satisfies all the requirements for in situ steering.

4 Extension of IS-PBVR

4.1 Block structured AMR data

The particle generation of PBVR can be processed for each element independently. This feature was used to optimize the particle generation on massively parallel supercomputers. In this work, we extend the particle generation to the block structured AMR grid used in CityLBM.

In the block structured AMR grid, the computational domain is divided by an octree structure, and the grid resolution is varied depending on the scale of flows in each region. Here, finer grids are generated by repeating octree-based subdivision of the subdomain, and the grid level is defined by the number of subdivisions. Each subdomain called the leaf consists of N3 Cartesian grids, which enable highly efficient GPU computation with continuous memory access. The particle generation function is extended to handle the block structured AMR grid with the same interface as the structured and unstructured grids. The data format of the structured grid is extended from 3D to 4D with N3 × L, where L is the number of leaves on each MPI process. The structure data “grid” is extended by adding the position of each leaf.

In CityLBM, the physical values are defined at the center of each element, and the values at the grid vertices are calculated by averaging the values at the elements surrounding them. This center-vertex conversion from N3 × L grids to (N + 1)3 × L grids is performed on the GPU supercomputer. In the particle generation, each leaf is processed as (N + 1)3 Cartesian grids inside the loop of the length L. For the particle generation in the leaf, the number of particles is calculated by multiplying the volume of the element by the particle density, and the value and the gradient at each particle position are calculated by trilinear interpolation within the element. Here, the Jacobian is calculated by taking account of the grid size at each grid level.

The particle generation is parallelized using MPI without changing the domain decomposition used in the simulation. In each subdomain, the loop of the length L is parallelized by OpenMP, and SIMD optimization is applied to (N + 1)3 Cartesian grids.

Many visualization applications, including ParaView, do not support direct volume rendering for the block structured AMR data and convert it to the unstructured hexahedral data. On the other hand, the proposed method can directly process the block structured AMR data, leading to significant reduction of the data size from the unstructured hexahedral data. In the block structured AMR grid, the number of grids is (N + 1)3 × L, the number of variables is P, and the data type is float. The size of dataset consisting of the physical values and the coordinates of leaves is calculated as ((N + 1)3*L*P + L*3)*4 bytes. On the other hand, the hexahedral unstructured data consists of the physical values, the coordinates of grids, and the connection of elements, leading to the total data size of ((N + 1)3*L*P + (N + 1)3*L*3 + N3*L*8)*4 bytes. For example, if N = 4 and P = 4, the memory consumption of the unstructured hexahedral data is ∼ 2.8 × larger than the proposed method.

4.2 Extension for GPU supercomputer

CFD simulations optimized on GPU supercomputers tend to have low CPU utilization rates because most stencil computation is performed on the GPU. In extending IS-PBVR to GPU supercomputers, we use surplus CPU resources to avoid interference with the simulation on the GPU. Algorithm 1 shows the pseudocode of a time step loop in a CFD simulation coupled with IS-PBVR functions. In the time step loop, the stencil calculation and the center-vertex conversion are implemented in GPU kernels. As the GPU kernels are implemented using unified memory, referring to the data on the CPU at the output step triggers implicit data transfer from the GPU memory to the CPU memory. In situ visualization can be easily implemented by replacing the output process of the simulation data with the API provided by the particle generation function, ParticleGeneration(grid, value, geom), where the structure data “value” is generated by processing the center-vertex conversion on the GPU supercomputer.

5 Development of in situ steering framework

5.1 In situ analysis of time series data

While 3D visualization and various scientific visualization help qualitative or intuitive understanding of the global data structure, quantitative data analysis is also necessary for steering the simulation parameters. To this end, in addition to scientific visualization via PBVR, the proposed framework provides quantitative data analysis functions using 2D graphs. The proposed interface consists of a spatial domain view that displays 3D visualization results and a time domain view that displays time series data in 2D graphs (see Fig. 2). The proposed framework inherits the spatial domain view from the original IS-PBVR, and visualizes multi-variate data using multi-dimensional transfer functions. On the time domain view, the time series data at the monitoring points defined by users are displayed in 2D graphs, where their moving averages are also plotted. In adjusting steering parameters, users can make quantitative comparisons between the monitoring data and the reference data such as experimental values to be compared with the simulation and target values for optimizing parameters.

figure b
Fig. 2
figure 2

Interface of the proposed in situ steering framework. The red frame is the steering interface, where the steering parameters are input as text. The blue frame is the spatial domain view that displays the 3D visualization results. The green frame is the time domain view, which displays the time series data at monitoring points as a matrix of 2D graphs. The pink points on the spatial domain view are the monitoring points shown also in Fig. 6

The time domain view is controlled by the structure data defined in Table 1, in which users can specify the number and location of monitoring points, the layout for a matrix of 2D graphs, the range and scale (linear or logarithmic) of each axis, the reference value at each monitoring point, and the width of moving time average. Here, the monitoring parameters, which is given as a monitoring parameter file, are loaded on Client, and the information of monitoring points is transferred from Client to Daemon and, then, stored as a control file on the storage of supercomputers.

Table 1 Attributes of monitoring parameters

As the monitoring data is sampled from the same volume data as the particle data, sampling of the monitoring data is implemented by extending the particle generation function, ParticleGeneration. In addition to the original particle generation process, the extended function loads the control file, samples the physical values at the monitoring points, and outputs additional monitoring data, which are defined in Table 2. Daemon processes the distributed monitoring data files in the same manner as the particle data files and transfer the monitoring data to Client via socket communication.

Table 2 Attributes of monitoring data sent from Server

5.2 In situ steering function

In the in situ steering framework, the file-based control of IS-PBVR is extended including simulation parameters used for steering the simulation. The simulation parameters are transferred from Client to Daemon together with the information of monitoring points, and they are stored in the same control file. Figure 1, bottom, shows the data flows of the monitoring data, the particle data, the simulation parameters (and the information of monitoring points), and the visualization parameters in the in situ steering framework.

Sampler outputs the monitoring data and the particle data as distributed files, which are collected by Daemon. Here, Sampler aggregates the monitoring data and the particle data among MPI processes on the same computation node, and generates two files per node to reduce the number of output files. The file name of each output file contains the information of the node rank, the total number of nodes, and the time step. Daemon checks the node ranks of the output files at each time step, and when all node ranks are confirmed, the distributed files are collected, and the monitoring data and the particle data are transferred to Client.

The library of the in situ steering framework provides the steering function, Steering(). This function is called in the time step loop, checks update of the control file, and loads the updated simulation parameters onto the structure data “Parameters”, which is shared with the simulation. The data format of “Parameters” is summarized in Table 3. After updating the new structure data, it is reflected to the simulation from the next time step.

Table 3 Attributes of the structure data “Parameters” for steering

In Client, the simulation parameters are input via two input methods: a text-based GUI input method which support a few parameters (Fig. 2, red frame), and a file-based input method for large input data such as 3D datasets of boundary conditions. An example of simulation parameters are shown in Table 4.

Table 4 An example of simulation parameters; pollutant source position (x, y, z), pollutant source concentration c, wind speed vector based on meteorological data (u0, v0, w0)(un−1, vn−1, wn−1)

6 CityLBM

In CityLBM, the lattice Boltzmann method (LBM), which is a numerical scheme for approximating the Navier–Stokes equations, has been dramatically accelerated on GPUs, achieving real-time plume dispersion analysis. LBM solves the Boltzmann equation with discrete velocity on the Cartesian grid. As its computation is characterized by stencil calculations with continuous memory access, LBM shows high thread parallel performance on GPUs. CityLBM also supports multi-scale CFD simulations involving of turbulent and laminar flows at significantly different scales, where the block structured AMR grid enables high-resolution analysis while suppressing the memory usage. In this work, we address two application examples, the debris air cooling analysis and the plume dispersion analysis.

In the debris air cooling analysis, a free convective heat transfer experiment emulating debris air cooling in a nuclear reactor was analyzed (Onodera et al. 2020). In this analysis, the temperature distribution formed by complex thermal convection was reproduced by using the block structured AMR grid. Here, high-resolution grids are used near the wall where the temperature and velocity gradients are high, and low resolution grids are allocated in the interior region where the gradients are moderate, thereby reducing the memory usage. Figure 4 shows the block structured AMR grid and the visualization results.

Another application of CityLBM is the plume dispersion analysis in urban districts. Since the atmospheric boundary layer is characterized by fine scale turbulent flows near the ground and larger scale laminar flows at high altitudes, the block structured AMR grid was arranged depending on the altitude. This enabled detailed local wind analysis reflecting global meteorological data in a several km square computational domain with meter-scale resolution near the ground. The plume dispersion is modeled as transport of the passive scalar particles, and the pollutant concentration is calculated using the mass conservation law. In this study, we conduct numerical experiments based on the plume dispersion experiment in Oklahoma City (Onodera et al. 2021) and examine the utility of the in situ steering framework in an inverse problem to find a pollutant source from the observation data. A computational model is built based on the digital surface model (DSM) and the digital elevation model (DEM), and the plant canopy model for parks is taken into account (Fig. 6). The boundary conditions of wind speed and temperature are given based on the meteorological data.

7 Experiments and results

Our numerical experiments are computed using GPU supercomputer HPE SGI8600 and it is connected to a user PC1 and a use PC2 with the network bandwidth of ∼ 3.0 MB/s (see Table 5).

Table 5 Specification of a single node of HPE SGI8600, the user PC1 and the user PC2

7.1 Performance comparison between ParaView Catalyst and IS-PBVR

We compare the performance of in situ visualization for the debris air cooling analysis between ParaView Catalyst (ParaView) and IS-PBVR. The block structured AMR grid system consists of 24,336 leaves for Lv.1 and 499,904 leaves for Lv.2, and each leaf is given by 43 Cartesian grids, leading to 33,551,360 total grids (see Fig. 3, left). The numerical experiments are conducted using two computation nodes with 8 GPUs on the SGI8600 and the user PC1 (see Table 5). The volume rendering images are generated using the same transfer function and color map. APIs of IS-PBVR and ParaView were, respectively, implemented in the simulation. ParaView has two in situ visualization modes. One is client–server rendering mode, which enables interactive in situ visualization but requires large visualization data transfer between server and client. The other is off-screen rendering mode, in which visualization parameters are prescribed but only small image data is transferred to the client. In this benchmark, we choose the latter mode to minimize the cost of data transfer. For the volume rendering option, the ray-tracing is employed instead of the projected tetrahedra (the default option on ParaView), because the latter option caused rendering artifacts. In IS-PBVR, particle data size, particle generation time, particle transfer time, and image production time were measured for the repeat levels of 9, 16, and 25. Figures 3 right and 4 show the visualization results of ParaView and IS-PBVR, respectively. These figures show that IS-PBVR produces comparable images to ParaView, and that IS-PBVR produces higher quality images as the repeat level increases. Table 6 shows the performance measurement results. The overall processing costs of IS-PBVR were 44.4 s, 77.9 s, and 122.7 s at the repeat levels of 9, 16, and 25, respectively. These are about 8.7 ×, 5.0 ×, and 3.2 × faster than that of ParaView, 388 s. IS-PBVR has the trade-off between the performance and the image quality, which can be selected based on the purpose of visualization and the network bandwidth. In changing camera positions, IS-PBVR does not require the particle generation and the particle data transfer, while ParaView requires all the visualization processes. The performance of interactive viewpoint change in IS-PBVR is more than four orders of magnitude faster than ParaView.

Fig. 3
figure 3

Air flows and the temperature in the debris air cooling analysis. Left is a cross section of the block structured AMR grid (plotted with 4 × smaller grid size). Right show the volume rendering result using ParaView Catalyst

Fig. 4
figure 4

The visualization results of the debris air cooling analysis for the repeat levels of 9, 16, and 25 from left to right

Table 6 The computational costs of IS-PBVR with three repeat levels and ParaView Catalyst

7.2 Application of in situ steering to inverse problem

In order to find a pollutant source from the observation data at monitoring points, one needs to address huge parameter scans and complex inverse problems based, e.g., on data assimilation approaches or machine learning approaches. The former require enormous computing resources, while the latter may not be always successful depending on the nonlinearity of the problem and the quality of the observation data. On the other hand, with interactive in situ steering, an approximate optimal solution may be simply obtained using the human-in-the-loop approach, where intuitive feedbacks to the simulation are made by observing and understanding the response of the simulation to the parameter change.

In the numerical experiment, we consider the 4 km square computational domain, which is resolved using the block structured AMR grid with 4 m resolution near the ground, and compute the airflow and the plume dispersion from a single pollutant source. As shown in Table 7, three levels of AMR grids are arranged depending on the altitude, and the total number of grids becomes about 1.3 × 108. The numerical experiment is computed using 8 computation nodes with 32 GPUs on the SGI8600 and the user PC2, (see Table 5). In the above simulation conditions, the performance of CityLBM is 2 × faster than real-time.

Table 7 The block structured AMR grid used in the plume dispersion analysis

The center-vertex conversion is processed on the GPU, and then, the particle generation and the data sampling at the monitoring points are computed on the CPU. The cost of the center-vertex conversion and data transfer from the GPU memory to the CPU memory is 0.05 s/step.

In the particle generation, we compute visualization particle data for volume rendering of the plume distribution, cross section of the wind speed distribution, and volume rendering of vortex structures represented by the second invariant of the velocity gradient tensor [Q-criterion (Hunt et al. 1988)]. The transfer function of the plume distribution is defined using the concentration in logarithmic scale. The cross section is generated by defining the color transfer function by the wind speed and by giving a peak of the opacity transfer function at y = 700 m. In defining the transfer function for the Q-criterion, we synthesize the Q-criterion from the velocity vector (u, v, w) by an algebraic equation including differential operators. Finally, these transfer functions are synthesized to design a multi-dimensional transfer function for multi-variate visualization. Table 8 shows the particle data size and the costs of particle generation and data transfer for each visualization object. Figure 5 shows the processing time chart for CityLBM and IS-PBVR. The center-vertex conversion and particle generation take ∼ 0.75 s/step. Daemon requires ∼ 0.06 s/step for collecting the particle data and the monitoring data, and ∼ 5.6 s/step for transferring them. The overall cost of the in situ visualization is ∼ 6.41 s/step, in which the cost of ∼ 0.75 s/step for Sampler cannot be overlapped with the simulation, while the cost of ∼ 5.66 s/step for Daemon can be masked behind the simulation and the center-vertex conversion. In order to hide the latter cost, the sampling rate of CityLBM is designed, and the in situ visualization is processed at every 100 time steps, where a single time step with the time step width of ∆t = 0.12 s (in simulation time) is processed in ∼ 0.06 s, and thus, the particle data and the monitoring data are generated at every 12 s of simulation time. In the current setting, the overhead of IS-PBVR ∼ 0.75 s/step is estimated as ∼ 12.5% of the simulation cost ∼ 6 s (100 steps ∼ 0.06 s/step). Since the cost of data transfer varies depending on the network condition, it is important to control the amount of data transfer. PBVR can control the number of particles by the image quality specified by the visualization parameters. The number of particles can also be controlled by selecting visualization objects defined in the multi-dimensional transfer function. If the cost of data transfer exceeds the interval of in situ visualization, it is possible to avoid extra overhead by interactively controlling the visualization parameters.

Table 8 The costs of particle generation and data transfer for each visualization object
Fig. 5
figure 5

Processing time chart of IS-PBVR. The top shows CityLBM coupled with Sampler on the computation nodes, and the bottom shows Daemon on the interactive node

In analyzing the inverse problem, users manipulate the position of pollutant source based on the observation of volume rendering images and time series data sampled at the monitoring points, which are located at the height of z = 2 m on the streets in the central district of Oklahoma City (see Fig. 6). In the current problem, a steady southwest wind is imposed as the boundary condition, and a single steady pollutant source is set at the height of z = 2 m. Therefore, the wind conditions and the distribution of plume concentration are expected to reach quasi-steady states after changing the position of pollutant source.

Fig. 6
figure 6

Left is a top view of the central district of Oklahoma City, USA (taken from google map). Right shows polygon data generated from DSM and DEM data. The horizontal (x) and vertical (y) axes correspond to the east and north directions, respectively. Pink dots A–I are the monitoring points. Yellow arrows indicate a southwest wind imposed as the boundary condition

The plume concentrations in a quasi-steady state are observed at the monitoring points A–I, and are plotted on the time domain view, where black and magenta curves show the monitoring data and its moving time average for ten observation steps, and blue curves and light blue bands are the tar- fig.

Figure 7 shows the GUI before the pollutant emission. Since the plume concentrations at the monitoring points A, B, D, and G were zero in the target simulation, blue lines are not shown for these points. Hereafter, the values observed in the simulation for analyzing the inverse problem are referred to as “monitored values”, and those in the target simulation as “observed values.”

Fig. 7
figure 7

GUI of IS-PBVR before the plume emission. On the spatial domain view, a cross section of the distribution of wind speed is displayed at y = 700. On the time domain view, the graphs corresponding to the monitoring points A–I are displayed, where the time series data of the plume concentrations for the past 64 observation steps are plotted in a logarithmic scale

The exploration of the pollutant source is started from (x, y, z) = (− 175, 70, 2), considering the southwest wind imposed at the boundary. Figure 8 shows the GUI after starting the pollutant emission. From the time domain view, it can be seen that all monitoring values are out of the factor 2 range. In Fig. 9, the wind condition, the Q-criterion, and the plume concentrations are visualized. The vortex tubes are generated in various orientations around the buildings. This indicates that the wind conditions are highly turbulent near the ground, and the plume may not necessarily be transported along the southwest wind.

Fig. 8
figure 8

Plume distribution and monitored values for the pollutant source (x, y, z) = ( − 175, 70, 2)

Fig. 9
figure 9

Visualization of vortex tubes by the Q-criterion is added to Fig. 8

Next, the pollutant source is shifted to the southeast direction, (x, y, z) = (− 30, − 130, 2), reflecting that the monitored values at D and G are overestimated, while very low concentrations are observed at E, H and I. In Fig. 10, the overestimate at D and G is resolved, but the monitored values at E and H are still too low, because most of the plume is transported in the northeast.

Fig. 10
figure 10

Plume distribution and monitored values for the pollutant source (x, y, z) = ( − 30, − 130, 2)

To increase the monitored values at E, H, and I, the pollutant source is moved to the north direction (x, y, z) = (− 30, 30, 2). In Fig. 11, the plume dispersion shows complicated transport along streets, which increases the monitored values at E, H, and I. The monitored values and their moving time averages are converged within the factor 2 range, and thus, the pollutant source in the target simulation is identified.

Fig. 11
figure 11

Plume distribution and monitored values for the pollutant source (x, y, z) = (− 30, 30, 2)

The results in Fig. 11 shows an interesting feature. Although the plume dispersion develop in the north direction along the central street, high concentrations are observed only at E and H. While the monitoring point B is located on the same street, it shows a low concentration. To understand this feature, the Q-criterion and the plume concentrations are visualized in Fig. 12. The visualization result shows that behind the tall buildings, longitudinal vortices are generated near the monitoring points E and H, and the plume is transported vertically along these vortices. As a result, the plume is transported at higher altitude, and it does not reach the monitoring point B near the ground.

Fig. 12
figure 12

Visualization of the Q-criterion and the plume concentrations for Fig. 11

8 Conclusion

In this study, we extended the in situ visualization framework IS-PBVR to an in situ steering framework, which visualizes and controls large-scale CFD simulations on GPU supercomputers at runtime. The file-based control enables interactive steering of the simulation at runtime without interfering with the simulation. The in situ visualization on the GPU supercomputer is designed to be processed on the host CPU without interfering with the simulation on the GPU and is extended to the block structured AMR data.

The performance comparisons between IS-PBVR and ParaView Catalyst were conducted for the debris air cooling analysis. IS-PBVR can control the image quality and the performance by varying the repeat level, and the image production with the repeat levels of 9, 16, and 25 was, respectively, 8.7 ×, 5.0 ×, and 3.2 × faster than ParaView Catalyst. IS-PBVR was also four orders of magnitude faster than ParaView Catalyst for the viewpoint changes, which are required for interactive visualization.

In the numerical experiment, we applied IS-PBVR to the plume dispersion analysis and addressed an inverse problem to find the pollutant source from the observation data on the monitoring points. The pollutant source was explored while analyzing the time evolution of the plume dispersion using the spatial and time domain views. On the spatial domain view, the complex turbulent structure is clearly shown by designing the multi-dimensional transfer function, which can be changed interactively. These highly flexible visualization functions enable us to steer the simulation parameters based on the intuitive understanding of extreme scale volume data. The above numerical experiment was realized by interactive in situ steering in a single batch job. If one conducts the same analysis with the conventional polygon-based visualization techniques, it takes an enormous amount of time to transfer the polygon data. Therefore, the proposed framework is particularly effective in debugging, exploring optimal solutions, and analyzing inverse problems through the human-in-the-loop approach.

In future work, the proposed framework will be extended including statistical analysis techniques and machine learning techniques, which are expected to expand the applicability of in situ steering to a wider range of simulations.