Correlation-aware probabilistic data summarization for large-scale multi-block scientific data visualization

In this paper, we propose a correlation-aware probabilistic data summarization technique to efficiently analyze and visualize large-scale multi-block volume data generated by massively parallel scientific simulations. The core of our technique is correlation modeling of distribution representations of adjacent data blocks using copula functions and accurate data value estimation by combining numerical information, spatial location, and correlation distribution using Bayes’ rule. This effectively preserves statistical properties without merging data blocks in different parallel computing nodes and repartitioning them, thus significantly reducing the computational cost. Furthermore, this enables reconstruction of the original data more accurately than existing methods. We demonstrate the effectiveness of our technique using six datasets, with the largest having one billion grid points. The experimental results show that our approach reduces the data storage cost by approximately one order of magnitude compared to state-of-the-art methods while providing a higher reconstruction accuracy at a lower computational cost.

very high resolution, which is accompanied by the generation of large-scale scientific data. Then, aided by scientific visualization, they convert the data into intuitive visualizations to study various problems [1]. However, the storage bottleneck of supercomputers makes it challenging to analyze and visualize such data efficiently at the original resolution [2]. For this reason, data reduction is a viable option.
After dividing the original single-block data into a series of data blocks according to the homogeneity requirement of visualization, the method adopts compact and universal distribution representations (e.g., histograms and Gaussian mixture models (GMMs)) instead of traditional grid representations for each data block [9][10][11]. Probabilistic data summarization significantly reduces the volume of data, and permits efficient post-hoc visualizations [12][13][14][15][16][17].
In spite of this, with the development of massively parallel computing, scientists often apply domain decomposition strategies [18], as depicted in Fig. 1 [19], so as to efficiently obtain high-precision solutions to complex real-world problems. This results in the generation of large-scale multi-block data. However, directly applying existing probabilistic data summarization methods to such data raises two issues. Firstly, the domain decomposition strategy neglects the inherent spatial consistency of the original data, which results in high data value variations within each block. Statistical modeling of such a data block inevitably introduces artifacts and discontinuities at block boundaries during visualization. Therefore, existing methods typically require merging of data blocks in different parallel computing nodes first, and then re-partitioning them according to the homogeneity requirements, resulting in high computational cost. Secondly, these methods do not model the correlations between different data blocks, and thus cannot yield highquality reconstruction results. Figure 2 shows an example where reconstruction results of existing methods suffer from limited accuracy, especially in terms of significant numerical discontinuities at block boundaries. It is a challenge to develop a probabilistic data summarization method that provides a good trade-off between low computational cost and high reconstruction accuracy.
In this paper, we propose a novel correlationaware probabilistic data summarization method to overcome the above challenges. Since the copula function [21] offers a statistically robust technique to model correlations of variables, it has been used for uncertainty modeling in ensemble datasets [22] and distribution modeling in multivariate data [23]. Therefore, we use copula functions to represent the statistical correlations between different data blocks, which is essential for efficient and high-precision visual analysis of large-scale multi-block scientific data.
Our method consists of two stages: correlationaware modeling and structural grid data reconstruction. In the first stage, we statistically model the numerical and spatial correlations between adjacent data blocks after calculating the numerical and spatial distributions of each data block [20]. Specifically, we use a histogram for the numerical distribution representation of each data block and a GMM for the spatial distribution representation of the grid samples in each histogram bin. Then, we use the Gaussian copula function to model the correlations between the numerical distributions of adjacent blocks and spatial distributions of adjacent bins. In the second stage, we use a copula-based structural grid data reconstruction method which couples the numerical information, spatial location, and correlation distribution by Bayes' rule to estimate a given spatial location's reconstructed value more accurately, particularly at block boundaries, as shown in Fig. 2(d).
The major contribution of our work is that we solve the problems of high computational cost and low reconstruction accuracy faced by existing methods when dealing with large-scale multi-block scientific data, through a correlation-aware statistical model. We demonstrate the effectiveness of our technique using six datasets; the largest contains one billion grid points. Our experimental evaluation considers reconstruction accuracy and computational cost. Compared to state-of-the-art methods, our method reduces the computational cost for data merging and repartitioning, which reduces the data storage cost by approximately one order of magnitude, yet it achieves higher reconstruction accuracy.

Compression
Data compression is a common method for data reduction. Lossy and lossless methods exist. Lossless compression is unsuitable for handling scientific data Fig. 2 Our correlation-aware probabilistic summarization method significantly reduces the data storage cost while ensuring high reconstruction accuracy. A multi-block volume rendering of 512 × 512 × 512 aircraft electromagnetic data is shown. Compared to the ground truth (a), the reconstruction result of our method (d) has higher accuracy than the reconstruction results of Refs. [9,20]  with floating-point characteristics. For example, improving the data compression ratio by an order of magnitude is difficult when using lossless compression methods such as run-length encoding [24,25] and bZIP [26]. Lossy compression is a relatively effective approach to reduce the data storage cost. For example, the geometric-driven lossy compression method [27] involves three main processing stages: grid vertex position quantization, prediction, and entropy coding. The asymptotic grid compression method has gradually become a focus of research to process the ever-increasing amount of data. Asymptotic octree coding [28], wavelet coding [29], geometric image coding [30], and related approaches have also emerged. However, lossy compression cannot retain the features of the original data with high accuracy while providing a high data compression ratio.

Feature extraction
Feature extraction is a data reduction process achieved by means of identifying and extracting key features of the data. Feature descriptors include isosurfaces, streamlines, streaklines, vector field topology, vortex tubes, cracks, and fault lines. For volume data, isovalues and transfer functions are typically used to extract the data features. Tzeng et al. [31] used scalar values, gradient values, and spatial position coordinates to train transfer functions for data feature recognition. Kindlmann et al. [32] used surface curvature to perform feature classification of data samples. Tenginakai et al. [33] defined an isosurface using neighborhood statistics. Hlad vka et al. [34] and Kniss et al. [35] achieved feature separation using an isosurface. However, the above feature extraction methods all possess weak universality because of their high dependence on a problem-specific feature definition.

Probabilistic data summarization
Probabilistic data summarization uses compact distribution representations instead of traditional grid representations, which can significantly reduce the data storage cost. This method has a strong ability to retain data features, but with particular visualization uncertainty. Thompson et al. [9] used a histogram to approximate the isosurface of structural grid data. Wei et al. [16] proposed an effective algorithm based on histograms to search for similar feature distributions in local regions of the original data. Liu et al. [10] and Dutta and Shen [11] used GMMs for the compact expression of data features. However, these methods only consider the numerical distribution of the original data, and ignore its spatial distribution, which results in low reconstruction accuracy and high visualization uncertainty. In regard to this problem, Wang et al. [20] proposed the spatial GMM, which uses a histogram to model the numerical distribution and GMM to model the spatial distribution. By coupling these two types of distribution models using Bayes' rule, the spatial GMM improves reconstruction accuracy. However, because the domain decomposition strategies in massively parallel scientific simulations do not consider the inherent spatial consistency of the original data, the above summarization methods cannot be adapted directly to large-scale multi-block data.

Correlation-based modeling
Existing probabilistic data summarization methods have low reconstruction accuracy for multi-block data, and lose essential statistical properties, particularly at block boundaries. The correlationbased modeling method improves reconstruction accuracy by considering data correlations. Dutta et al. [18] proposed a correlation clustering method based on the inherent spatial consistency of the original data, but it is unsuitable for multiblock structural grid data generated in a parallel scientific simulation. Wang et al. [36] created prior knowledge to capture correlations between lowresolution and high-resolution data to improve reconstruction accuracy. However, calculation of prior knowledge is highly time-consuming. Hazarika et al. [22,23] conducted multivariate correlation analysis to improve reconstruction accuracy and reduce visualization uncertainty. Differing from the above methods, we propose a statistical modeling method suitable for multi-block data, which makes probabilistic data summarization practical for large-scale scientific simulations. High-precision probabilistic data summarization adapted to the domain decomposition strategy of parallel scientific simulation must be studied and is of great significance.

Correlation modeling
A correlation model is a tool used to describe the dependency of several random variables in statistics. In this study, the copula function [21,37] is used to model numerical and spatial correlations between adjacent data blocks, so as to improve the reconstruction accuracy of data summarization.
An n-dimensional copula is a cumulative density function (CDF) with uniform marginal CDFs. For n-uniform random variables U 1 , · · · , U n , the copula is denoted C(u 1 , · · · , u n ), where u i is a realization of U i . Sklar's theorem states that a joint CDF (F ) of n random variables (X 1 , · · · , X n ) may be decomposed into their respective marginal CDFs (F 1 , · · · , F n ) and a copula function (C), represented as Eq. (1): The respective probability density functions (PDFs) of F , F i , and C are f , f i , and c. According to the relationship between the CDF and PDF, Eq. (1) may be further rewritten as Eq. (2): The connection between F i and F , C has many possible types, the most common type being the Gaussian copula function. It effectively reduces the data storage cost, and further permits efficient computation.
If F is an n-dimensional standard normal distribution, then the corresponding C in Eq. (1) is a Gaussian copula function: where Φ is the standard normal distribution function, Φ Θ is a normal distribution function with a zero mean vector and correlation matrix Θ, and W = [w 1 , · · · , w n ] is the integral variable vector. The corresponding c in Eq. (2) is c(u 1 , · · · , u n ; Θ) = ∂ n C(u 1 , · · · , u n ; Θ) ∂u 1 · · · ∂u n = 1 |Θ| Figure 3 shows a simple example of a two-dimensional Gaussian copula function. First, we generate samplesX 1 andX 2 from a standard bivariate normal distribution with Θ = 1 0.5 0.5 1 (see Fig. 3(a)). Because the outputs of the CDFs ofX 1 andX 2 follow uniform distributions, we transform the samples into a bivariate uniform distribution (see Fig. 3(b)). Finally, we transform the uniform distributions of U 1 and U 2 into the desired distribution types using the inverse functions of the CDFs of X 1 and X 2 (see Fig. 3(c)).

Revisiting the spatial GMM
Spatial GMMs [20,38] are used for compactly capturing spatial coherence between similar data values. The spatial GMM maps spatial locations to probabilities, unlike the block GMM [10,17], which maps data values to probabilities. For a spatial location p, the corresponding spatial GMM is where K is the number of Gaussian components, ω k , µ k , and Σ k are the weight, mean, and covariance matrix of the kth Gaussian component respectively, and K k=1 ω k = 1. A spatial GMM is a multivariate GMM, the dimensionality of which is determined by p. Determining a spatial GMM is equivalent to calculating its parameters (ω k , µ k , Σ k ). Due to the fact that the Gaussian component that the sampled data belongs to cannot be determined, this is equivalent to solving a parameter estimation problem with missing data, so the expectation-maximization (EM) algorithm [38] is used to solve Eq. (5).
To improve the reconstruction accuracy of a spatial GMM, current practice merges data blocks in different parallel computing nodes and repartitions them according to the homogeneity requirements of the visualization. However, merging and repartitioning have high computational costs. As shown in Fig. 4, for data with a resolution of 1000 × 1000 × 1000, the merging and repartitioning time for a spatial GMM increases as the number of cores increases, whereas the modeling timing decreases. Therefore, with increasing scale of massively parallel scientific simulation, multi-block data must be processed directly with the initial domain decomposition unchanged, to improve parallel performance. To achieve this goal, the proposed method uses Gaussian copula functions.

Fig. 4
Merging and repartitioning timing according to visualization requirements, using different numbers of parallel computing cores, for the original multi-block data with a resolution of 1000 × 1000 × 1000, and modeling timing for the spatial GMM.

Goals
We study probabilistic summarization for large-scale multi-block volume data. Our goal is to efficiently preserve the statistical properties of the original data with low computational and storage costs. We ensure that the original data are processed directly, with the initial domain decomposition unchanged.

Methodology
Due to the fact that low computational cost, low storage cost, and high reconstruction accuracy are mutually antagonistic and strongly coupled, it is very challenging to develop probabilistic data summarization which adequately satisfies all three requirements. One possible approach is to calculate compact distribution representations for each data block with the initial domain decomposition unchanged, then use copula functions for statistical correlation representations between different data blocks, and finally, combine all of the representations to achieve a high-precision reconstruction result. Hazarika et al. [22,23] took the statistical distribution representations of different variables as marginal distributions to conduct correlation modeling. However, they did not consider the correlation between different data blocks generated by massively parallel scientific simulations. In this work, we take the numerical and spatial distribution representations of different data blocks as marginal distributions to conduct correlation modeling, which improves reconstruction accuracy, particularly at block boundaries.

Workflow
To achieve the above goals, we propose a correlationaware probabilistic data summarization approach which consists of two stages. The approach used by our method is shown in Fig. 5: • Stage I: Correlation-aware modeling (see Section 5) computes statistical correlations between adjacent blocks represented by Gaussian copula functions NC and SC with the initial domain decomposition unchanged. Based on the numerical distribution represented by a histogram and the spatial distribution represented by a GMM, the correlation-aware data representation significantly improves the reconstruction accuracy, particularly at block boundaries. Finally, the summary is stored on disk, reducing the original data. • Stage II: Structural grid data reconstruction for post-hoc visualization (see Section 6) uses Bayes' rule to combine numerical distributions, spatial distributions, and statistical correlations, to accurately estimate the corresponding value for an arbitrary spatial location p. The correlation-aware modeling in our method can be used directly for processing massively parallel scientific simulations, dealing with the output of large-scale multi-block data, and generating probabilistic summaries with low storage cost for efficient analysis and visualization. Our probabilistic data summarization significantly reduces the computational cost and data storage cost for massively parallel scientific simulations, while maintaining the statistical properties of the original structural grid data with high reconstruction accuracy. Overview of our method. Given multi-block data, each single-block data's numerical and spatial distributions are represented by a histogram and GMM, and further combined as a spatial GMM. Correlation-aware probabilistic summarization of multi-block data is represented by the Gaussian copula function (Stage I). Consider the ith block (shown with a gray background). Nine blocks are contained in its 1-ring, so nine marginal distributions are taken into consideration. In Stage II, copula-based reconstruction helps various post-hoc multivariate analysis and visualization tasks using stored data summarization. Given an arbitrary spatial location, the corresponding value is reconstructed using Bayes' rule.

Single-block data summarization
Probabilistic summarization of single-block data forms the basis of the proposed method. The distribution representation includes two types: numerical distribution and spatial distribution.

Numerical distribution
Consider the ith block, Block i as an example. The variable is denoted X i . Next, the PDF of X i is estimated as Hist i , which is a histogram that consists of M bins, where each bin represents a scalar value range as an interval. The corresponding CDF is denoted by H i (X i ). If Bin j i denotes the jth bin, then the bin interval for Bin j i represents a scalar value range [L j i , U j i ]. If N j i denotes the number of grid samples in Block i which have a value in Bin j i and N i denotes the total number of grid samples in Block i , then Bin j i has a probability of N j i /N i .

Spatial distribution
The PDF of the grid sample locations P j i for the data values within Bin j i is estimated by SGmm j i , using the method in Section 3.2. S j i denotes the corresponding CDF.

Multi-block correlation summarization
Based on the single-block data summarization described in Section 5.1, Gaussian copula functions are used to model the numerical correlation between different histograms and the spatial correlation between different GMMs. Because the statistical correlation is completely determined by the Gaussian copula function, while being independent of the marginal distributions, the correlation calculation of these two types of marginal distributions is universal.

Numerical correlation modeling
The histogram associated with each data block compactly represents the numerical distribution of the grid samples within this block. However, the histogram cannot appropriately describe the statistical properties of scalar values at the block boundary, particularly those with large variations in data values. Therefore, the Gaussian copula function is used to capture the numerical correlations between these numerical distributions, where the histograms are treated as marginal distributions.
We now explain the formulation. For Block i (e.g., the blue block in Fig. 6), the histograms of the blocks in its 1-ring are treated as marginal distributions, and the correlation between them is modeled using a Gaussian copula function. According to the number of faces N f of Block i which intersect the original data boundary, four scenarios are possible: where the value of T n is 26, 17, 11, or 7.

Spatial correlation modeling
The GMM associated with each bin compactly represents the spatial distribution of the grid sample locations for the scalar values within this bin. As the scalar values of the grid samples in adjacent bins are similar, there must be spatial coherence among these spatial locations. Therefore, the Gaussian copula function is used to capture the spatial correlations between these spatial distributions, where the GMMs are treated as marginal distributions.
We now explain the formulation. For the jth bin of the ith block that is non-empty (i.e., grid samples exist), the GMMs of the bins in its 1-ring are treated as marginal distributions and the correlation between them is modeled using a Gaussian copula function (Fig. 7). According to the number N b of non-empty bins other than Bin j i , three scenarios are considered: where Φ Θ j i is a normal distribution function with a zero mean vector and correlation matrix Θ j i , and S j,t i (t = 1, · · · , T s ) is the GMM of the non-empty bin not including Bin j i , where the value of T s is 1 or 2.

Solver
Because the maximum likelihood function of the Gaussian copula functions in Eqs. (6) and (7) is complex and cannot be solved using partial derivatives, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [39] is used, which is the most popular quasi-Newton algorithm, to accelerate the EM algorithm for this parameter estimation problem with missing data.
We now consider discrete calculation of the copula's marginal CDF. To calculate the parameters of the Gaussian copula function, it is necessary to precalculate the marginal CDFs of value distributions and spatial distributions. The discrete calculation of H i and H t i (t = 1, · · · , T n ) in Eq. (6) is obvious, i.e., accumulation of probabilities corresponding to the bins satisfying certain conditions. The calculation of S j i and S j,t i (t = 1, · · · , T s ) in Eq. (7) is also computed discretely. Consider S j i as an example: if ∀p = (x, y, z) ∈ P j i , then the value of the corresponding CDF is where p s = (x s , y s , z s ) ∈ P j i , and x s x, y s y, z s z. After calculating the CDFs, the parameters of the Gaussian copula functions are estimated. The EM algorithm [38] is an iterative algorithm used for calculating the maximum likelihood of parameters, and it is widely applied to incomplete data. However, the EM algorithm has a sublinear convergence speed, and the derivative of the Q function in the EM algorithm has no explicit expression. Hence BFGS [39] is used to accelerate the algorithm. For the case of calculating SC j i in Eq. (7), pseudo-code is shown in Algorithm 1.
We determine step size and stopping conditions as follows. The Q function in Algorithm 1 is the expectation of the logarithmic likelihood function Algorithm 1 BFGS-based EM acceleration algorithm Input grid sample locations P j i , Gaussian copula function SC j i , and the maximum iterations nmax; Then, Θ j i,0 = I, n = 0; repeat η1 or n nmax of the complete data on the conditional probability distribution of the unobserved data. The step size α m is determined using a backtracking line search so that the energy decreases monotonically. To optimize g n (Θ j i ), α m in the mth (m > 1) iteration is initialized to α m−1 . α m is multiplied by two if the initial α m decreases the energy; α 0 = 1. For the stopping conditions, we set η = η 1 = η 2 = 10 −6 and n max = m max = 500 in our experiments.
In Fig. 8, a schematic diagram of the BFGSbased EM acceleration algorithm for estimating the parameter of a simple bivariate Gaussian copula function is shown to explain the algorithm. The gray curve represents the objective function (the logarithmic likelihood function of Θ j i ). The blue curves represent the lower bound function g n (Θ j i ), which constantly approaches the local optimum, converging to it.

Approach
The data storage cost of massively parallel scientific simulation is significantly reduced by probabilistic summarization. However, existing visual analysis algorithms are all designed for gridded data. Therefore, visual analysis applications must reconstruct the compact data represented by probabilistic summarizations. For structural grid data, only scalar values need to be reconstructed, which relies on probabilistic summarizations.

Joint distribution
Based on the numerical distribution and spatial distribution for each data block in Section 5.1 and the correlation distribution for multi-block data in Section 5.2, the joint distribution functions are obtained using Eqs. (1) and (3) (f i denotes the PDF of the joint numerical distribution function of Block i , and f j i denotes the PDF of the joint spatial distribution function of Bin j i ). Note that because the joint spatial distribution is an approximate representation and the domain is defined over an infinite space rather than a single data block, bias is introduced into the computation. Therefore, it is necessary for correctness to compute the normalized probability instead of the original probability of f i : where Ω i is the spatial region of Block i , and Ω i f j i (l) dl is the accumulated probability over Ω i of f j i .

Reconstruction
With the normalized probability of the joint numerical distribution function, reconstruction of structural grid data consists of three steps: • 1: Given an arbitrary spatial location p = (x, y, z), first, locate the block in which p is located. Let Block i be the desired block. • 2: Proceed through all bins of f * i , according to Bayes' rule. The probability that the value corresponding to p belongs to Bin j i is • 3: After traversing all bins are and calculating the corresponding probabilities, the value of the interval corresponding to the maximum of Prob p is the reconstructed value of p. Using this reconstruction modeling approach, a high-accuracy reconstruction result is obtained with better data consistency and continuity than existing methods, particularly at block boundaries. The reconstructed data have the same grid resolution as the original data.

Outline
The experimental evaluation considered three issues: reconstruction accuracy was evaluated using different location representations, different correlation representations, and different block sizes (see Section 7.6). Computational performance was evaluated from the perspective of timing, to demonstrate the applicability and efficiency of the proposed method for massively parallel scientific simulations (see Section 7.7). A further evaluation on very large-scale data is provided in Section 7.8. A final summary and analysis is provided in Section 7.9. First we discuss the experimental setting.

Datasets
To evaluate the reconstruction accuracy and computational performance of the proposed method for large-scale multi-block scientific data visualization, six sets of scientific data with a maximum grid size of one billion cells were used, as shown in Table 1.

Experimental setup
We used a single node of Inspur's high-performance server system (Inspur 2017) with 24 cores and 512 GB of RAM, and the Sugon supercomputer with 24 cores and 64 GB of RAM to test a Message Passing Interface (MPI) based parallel program. In the probabilistic summarization calculation, the number of bins in each histogram was set to 128. The number of Gaussian components in the spatial GMM was set to 4.

Competitors
We compared our approach with four state-of-theart methods: the block histogram model [9], the spatial GMM [20], an entropy-based model [40], and the SLIC-based model [18]. The block histogram model and the spatial GMM improve reconstruction accuracy by augmenting the representation with location information, while the entropy-based model and the SLIC-based model do so by extending the representation with correlation information.

Correlation retention
To demonstrate that the proposed correlationaware probabilistic summarization possesses obvious advantages for correlation retention, the Pearson correlation coefficients were calculated for adjacent blocks of data X and Y under the same domain decomposition strategy. Then they were stored in two sequences P X and P Y . The value mean deviation (VMD) of P X and P Y was used to measure the correlation retention: where L is the number of elements in P X and P Y .

Reconstruction error
The normalized root mean squared error (RMSE) and normalized maximum error (NME) were used to evaluate the reconstruction error. These are calculated as Eq. (12): 12) where X denotes the original data, Y denotes the reconstructed data, N is the total number of grid samples in X, and X range = max

Structural similarity
Structural similarity (SSIM) [41] was used to measure the similarity of a pair of data plots, and is calculated as where I X and I Y are plots of X and Y , respectively, µ I X is the mean of I X , µ I Y is the mean of I Y , σ I X is the variance of I X , σ I Y is the variance of I Y , and σ I X I Y is the covariance of I X and I Y . C 1 and C 2 are constants to prevent the denominator from becoming zero. C 1 = (K 1 L) 2 and C 2 = (K 2 L) 2 , where K 1 = 0.01, K 2 = 0.03, and L = 255.

Location representation
The block histogram model [9] and spatial GMM [20] improve reconstruction accuracy by augmenting the representation with location information. By comparing our method to these two methods, we demonstrate that our proposed method has better reconstruction accuracy in processing large-scale multi-block scientific data. Figure 9 shows the LF reconstruction results using the block histogram model, the spatial GMM, and our correlation-aware model. The primary drawback of the block histogram model was the loss of the spatial distribution of the original data, which was improved by using the spatial GMM. However, these two summarizations lost the correlations between different data blocks, which resulted in an apparent discontinuity in the reconstruction results. Our proposed correlation-aware probabilistic summarization eliminated discontinuities and produced smoother reconstruction results compared to existing methods. A statistical analysis of the reconstruction results of the above three probabilistic summarizations is presented in Table 2. The VMDs of the reconstruction results were 6.314×10 −2 , 8.486×10 −3 , and 7.102×10 −4 , respectively. The RMSEs between the reconstruction results and the original data were 0.372, 0.122, and 0.007, respectively. The data compression ratios R of the three models were 2:1, 42.7:1, and 54.4:1, respectively. The qualitative and quantitative analysis results show that the effect of data reduction was significant, whereas the reconstruction accuracy of the proposed method was the highest.

Correlation representations
The correlation information essentially includes numerical correlation information and spatial correlation information. The SLIC-based model [18]  implements spatial correlation modeling using data clustering based on inherent spatial consistency.
The entropy-based model [40] implements numerical correlation modeling by computing mutual information. Figure 10 shows the SW reconstruction results using the SLIC-based model, the entropy-based model, and our correlation-aware model. The main drawback of the SLIC-based model is its incompatibility with massively parallel scientific simulation. As the domain decomposition strategy of the parallel simulation does not consider the inherent spatial consistency of the original data, the SLIC-based model merges data blocks in different parallel computing nodes and repartitions them, which results in a series of new data blocks with low value variation, thereby further optimizing the reconstruction accuracy ( Fig. 10(b)). The merging and repartitioning of different data blocks has a high computational cost, in turn resulting in significant performance bottlenecks. The entropybased model had limitations in extracting specific spatial structures embedded in the underlying features, and was sensitive to the level of discretization, that is, the number of histogram bins (Fig. 10(c)). The proposed correlation-aware probabilistic summarization processed multi-block data, while keeping the initial data blocks unchanged, and achieved higher reconstruction accuracy than the existing methods ( Fig. 10(d)). A statistical analysis of the reconstruction results of the above three probabilistic summarizations is presented in Table 3. The VMDs of the reconstruction results  . 9 Visual comparison of a volume rendering of (a) the ground truth and the reconstruction results of LF, using (b) the block histogram model [9], (c) spatial GMM [20], and (d) our method. Fig. 10 Visual comparison of a section pseudo-color rendering of (a) the ground truth and the reconstruction results of SW, using (b) the SLIC-based model [18], (c) the entropy-based model [40], and (d) our method.

Block size
The size of data blocks b depends on the domain decomposition strategy in massively parallel scientific simulations. The applicability of our method to parallel application characteristics of scientific simulations is evaluated by comparing the data in Table 1 with different block sizes. Figure 12 provides a histogram of block size versus − lgRMSE and block size versus compression ratio R for the reconstruction results obtained using the proposed method for LF, SW, AE, and HI. Four block sizes were used in the experiment. The results show that the smaller the block size, the smaller the RMSE, and the smaller the data compression ratio, for all datasets. Figure 11 shows an isosurface rendering for HI. The reconstructed result from the spatial GMM with b = 8 has apparent numerical discontinuities between adjacent data blocks (see Fig. 11(b)) with R = 8.5 : 1. However, the proposed method with b = 16 improved the numerical continuity of reconstructed data with R = 41.6 : 1, particularly at block boundaries (Fig. 11(c)). The proposed method reduced the data storage cost by approximately one order of magnitude compared to the spatial GMM, while achieving a higher reconstruction accuracy.

Intra-node performance
A single node of Inspur's high-performance server system was used for the experiments. For LF in Fig. 9, each core of the single node allocated 2730 data blocks. The parallel computation timing for stage I of the proposed method was 42.31 s (33.89 s for single-block data summarization and 8.42 s for multi-block data summarization). The parallel computation timing for stage II of the proposed method was 13.45 s. For AE in Fig. 2, 10,752 data blocks were allocated to each core. The parallel computation timing for stage I of the proposed method was 254.04 s (155.25 s  for single-block data summarization, and 98.79 s for multi-block data summarization). The parallel computation timing for stage II of the proposed method was 51.96 s. Due to the fact that modeling computation and reconstruction do not emphasize realtime performance, the above timings are still acceptable to users, and increasing the number of parallel cores would shorten them further. Moreover, the proposed method did not require block merging and repartitioning, which significantly reduced the computational cost. The merging and repartitioning timings of the proposed method for LF and AE were 0.43 s and 1.09 s, respectively.

Performance comparison
The merging and repartitioning timing T c , modeling timing T m , reconstruction timing T r , and total timing T total for the different methods were recorded to illustrate the advantages of the proposed method in terms of speed. The entropy-based model and the proposed correlation-aware model directly process the original multi-block data; hence their merging and repartitioning timings are relatively short. The block histogram model and spatial GMM reclassify the merged data, whereas the SLIC-based model requires iterative clustering, which result in longer merging and repartitioning timings. As shown in Fig. 13, merging and repartitioning accounted for most of the total timing for the block histogram model, spatial GMM, and the SLIC-based model. The proposed method took the shortest total time for LF and AE to obtain reconstruction results with the highest accuracy.

Reconstruction accuracy
To verify the effectiveness of the proposed method in processing very large-scale scientific data, two sets of real-world simulation data, AIT and AIV, were Intra-node performance overhead of (a) LF and (b) AE using the block histogram model [9], spatial GMM [20], slic-based model [18], entropy-based model [40], and our correlation-aware model. used for this experiment. Figures 14 and 15 visualize the reconstruction results using the spatial GMM and the proposed correlation-aware probabilistic summarization, respectively. The proposed method significantly improved the numerical continuity of the reconstructed data, particularly at block boundaries. Additionally, the proposed method also achieved efficient data compression for large-scale multiblock scientific data. A statistical analysis of the reconstruction results of spatial GMM and the proposed method for AIT and AIV is presented in Table 4. The VMDs of the reconstruction results were 8.662×10 −3 and 9.019×10 −4 for AIT, and 6.201×10 −3 and 8.197×10 −4 for AIV. The RMSEs between the reconstruction results of the above two probabilistic summarizations and the original data were 0.103 and 0.004 for AIT, and 0.094 and 0.004 for AIV. The data compression ratios R of the two models were 21.5:1 and 22.2:1 for AIT, and 10.9:1 and 11.2:1 for AIV. The qualitative and quantitative analysis results showed that the reconstruction accuracy of the proposed method was higher and the data storage cost reduced by two orders of magnitude compared to the original data.

Inter-node performance
Multiple nodes of the Sugon supercomputer were used for very large-scale data tests. As the number of 14 Visual comparison of (above) a volume rendering and (below) a section pseudo-color rendering of (a) the ground truth and the reconstructed results for AIT, using (b) the spatial GMM [20] and (c) our method.

Fig. 15
Visual comparison of (above) a volume rendering and (below) a section pseudo-color rendering of (a) the ground truth and the reconstructed results for AIV, using (b) the spatial GMM [20] and (c) our method.
nodes increased and the merging and repartitioning timings increased, the advantages of the proposed method became more apparent. Table 5 gives the merging and repartitioning, modeling, reconstruction, and total timings for the spatial GMM, SLIC-based model, and the proposed model for AIT and AIV using 128, 256, and 512 cores. For the three models, as the number of cores increased, the merging and repartitioning timings increased, the modeling and reconstruction timings shortened, and the total timings shortened. It can be seen that the total timings for the proposed method was much lower than for the other two methods.

Summary and analysis
The test data shown in Table 1 belong to five different Judging by the wide applicability of local statistical data models in the visualization community, there is a growing need for accurate and efficient statistical data summarization techniques. However, existing univariate summarization techniques ignore the correlations between different data blocks, resulting in the introduction of artifacts and apparent discontinuities at block boundaries during visualization.
In addition, existing methods of multivariate data summarization only model the dependency structures of the variables. They ignore the inherent spatial data coherency and struggle to realize high-precision reconstruction results efficiently when dealing with large-scale multi-block univariate data.
Our method considers the statistical correlation properties of adjacent data blocks to obtain reconstruction results with better continuity and preservation of physical characteristics of the original data. In addition, our method reduces the computational cost; its performance advantage is more significant with the increasing scale of massively parallel scientific simulation. Our method helps users with accurate and efficient visualization and allows them to enhance their understanding of scientific phenomena.
In conclusion, our method realizes data reduction of approximately one order of magnitude and makes the probabilistic data summarization of large-scale multi-block scientific data more efficient. Further, it is widely applicable to typical numerical simulation applications based on structural grid data.

Achievements
We have proposed a correlation-aware probabilistic data summarization method for large-scale multiblock scientific data visualization. Our method significantly improves the accuracy of reconstruction results and reduces visualization uncertainty by statistical modeling of the numerical and spatial correlation between adjacent data blocks stored in different parallel computing nodes. Additionally, our method processes multi-block structural grid data generated in massively parallel scientific simulation, while keeping the initial data blocks unchanged, thereby significantly reducing the computational cost. In our experimental tests, we used six sets of scientific data, with a maximum grid size of one billion cells. Quantitative analysis results reveal that our method reduced the data storage cost by nearly one order of magnitude compared to existing methods. For the same data compression ratio, the maximum accuracy of data reconstruction was improved by almost two orders of magnitude.

Multivariate data
In future work, we hope to consider using copula functions to model the correlation between large-scale time-varying multivariate multi-block scientific data, further widening the applicability of our correlationaware probabilistic data summarization approach.

Computational cost
A few limitations exist, which we will also address in future work.
Our method heavily relies on nonlinear optimization, which solves linear systems to generate descent directions, so is time-consuming. In future, it would be of value to develop dedicated parallel solvers, for improved speed.

Parameter settings
Several parameters must be set in advance in our method, such as the block size, the number of bins in each histogram, and the maximum number of Gaussian components in each spatial GMM. Changes to parameters have a significant influence on the reconstruction accuracy and computation performance for data with various statistical properties and resolutions. At present, we set these parameters entirely according to practical experience. It would be interesting to investigate setting these parameters automatically using a deep learning framework.

Unstructured and ensemble data
As a mainstream method for understanding data, visualization reveals complex structures in data which cannot be understood in any other way. However, unstructured and ensemble data have no predefined data model and are difficult to standardize, which makes our approach unsuitable and difficult to generalize naturally. In future, we will consider probabilistic data summarization for unstructured and ensemble data. In addition to modeling the numerical and spatial distribution of grid samples, we must also model the topological connections between grid samples, and a suitable distribution model should be investigated. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.