1 Lossy Compression, Storage, and Statistics

1.1 Lossy Compressors

Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. Lossy compression enables significantly reducing the data size without sacrificing data integrity, which is a concerning research problem for many of today’s scientific projects. Lossy compressors typically comprise (i) a decorrelation step that exploits correlations present in the dataset to transform the data into a more compressible version, (ii) an approximation/quantization step that reduces the precision of the input data bringing the lossyness in the compression pipeline, and (iii) an encoding step that minimizes the number of bits used to represent the approximation step outcomes. Different lossy compressors leverage different decorrelation, quantization, and encoding methods. Leading lossy compressors include SZ3 (cubic-spline interpolation-based prediction Zhao et al. 2021), SPERR (wavelet based Li and Clyne 2022), ZFP (near-orthogonal transform Lindstrom 2014), TThresh (singular value decomposition based Ballester-Ripoll et al. 2020, and Big Grooming Zender 2016). In the following, we focus on SZ3 because it provides the best compression quality based on previous studies (Underwood et al. 2022a).

1.2 Lossy Compression in Earth System Modeling

Earth system model outputs are particularly relevant to lossy compression because they generate large outputs that present correlations. Recent studies have quantified the influence of spatial correlation on lossy compression outputs. In Klöwer et al. (2021), the concept of bitwise real information (BIR) is introduced as the mutual information of bits in adjacent grid points. In particular, the stronger the association with neighboring bits, the greater the BIR. In Krasowska et al. (2021), global and local measures of spatial correlation via variogram estimation are introduced as explanatory variables of compression ratios (ratio of the original data size to the compressed data size).

In lossy compression, the question of evaluating the quality of reconstructed data naturally brings in statistics. For instance, Baker et al. (2017) address the issue of striking a balance between meaningfully reducing data volume and preserving the integrity of the simulation data via a series of quality assessment metrics: the structural similarity image metric (SSIM/d-SSIM), the p-value of the Kolmogorov–Smirnov test, the Pearson correlation coefficient of determination (\(R^2\)), and the spatial relative error. In their paper, Baker et al. accompany each metric with an acceptable threshold. In a recent work, Underwood et al. (2022a) propose an extensive comparison of 11 lossy compressors for Community Earth Systems Model (CESM) simulations and suggest the use of different metrics based on probability distributions including the Wasserstein distance.

1.3 Numerical Results

In this section, we collect results of some compression algorithms that have been performed on the original data provided by the authors, and we compare some of the decompressed data with the simulated ensemble proposed by the authors. We perform the analysis in the monthly regional context as this being the largest data provided by the authors.

The example data are provided as a 3-dimensional dataset in which the first and third dimensions, representing the regions and the ensemble members, are numerically uncorrelated. This structure would be difficult for most compressors to capture because they are designed to exploit local (typically less than 16 elements away) spatial correlations within the dataset dimensions. When data are presented as 1-dimensional arrays to the compressors, however, the compressors are able to better capture this local behavior. SZ3 with its dynamic cubic-spline predictor captures this behavior very accurately at competitive compression ratios. Figure 1 shows the temporal autocorrelation of the monthly average temperature (that is input as such in a compressor) at region #10 (chosen arbitrarily) and their detrended and deseasonalized (accounting for 6- and 12-month periods) counterparts (typically operated by statisticians). In this figure, we highlight the high correlation ranges present in the data, in both raw and detrended and deseasonalized data, as well as periodicities (6- and 12-month periods), which vastly exceed what most compressors can observe and exploit to increase compression ratios. To find a configuration of the compressor that is the closest to matching the compression ratio of the proposed statistical model, we use OptZConfig, developed by Underwood et al. (2022b), which finds an error bound enabling a given compression ratio. In our case, the absolute error bound is sought between 1 and 100. SZ3 was able to match within 1% the compression ratio achieved by the authors by using an absolute error bound of 12.9.

Fig. 1
figure 1

Left: Time series of each monthly and regional signal for 4 consecutive years of the study (2006-2010). Solid black line: original data; solid red lines: stochastic samples proposed by the authors; and dots: reconstructed data from compression with SZ3 (green: absolute error bound of \(10^{-1}\), grey: absolute error bound of 1). Central and right: temporal autocorrelation of monthly regional temperature at region #10, respectively, for the raw data (typically input in a compressor) and detrended and deseasonalized data (typically operated by statisticians) (Color figure online)

The left panel of Fig. 1 shows the monthly regional (region #10) temperature and its counterparts coming from the stochastic model proposed by the authors and from a reconstruction after compression with SZ3. We note that the overall trends, periodicities, and ensemble spread are captured by most methods. We also observe the trade-off between compression techniques and the stochastic simulations. Stochastic simulations reproduce the statistical behaviors of the reference data and as many samples as can be generated. On the other hand, compression methods capture pointwise behaviors of each ensemble member and provide a tool applicable in a general context (i.e., other variables); however, they cannot generate new scenarios. We also exemplify the various quality of reconstructed data from compression by showing two different absolute error bounds, \(10^{-1}\) and 1. As the error bound becomes more permissive, the accuracy of the reconstruction decreases; however, the compression ratio increases. In the left panel of Fig. 2, we compare the distributions of the original data, stochastic simulations from the authors, and decompressed data. We note that simulation and compression methods both perform well but do not capture the same parts of the distribution. The right panel of Fig. 2 shows the distributions of the Wasserstein distance computed between the reference data and the simulated and decompressed data, following (Underwood et al. 2022a). Each boxplot data point represents the Wasserstein distance computed for a region. Stochastic samples, reconstructions from SZ3 at the \(10^{-1}\) and 1 error level, exhibit a low median Wasserstein distance. However, the most permissive compression with tolerated errors at 12.9, corresponding to the compression ratio matching the authors’ one, does not capture well the distribution of the reference data. As pointed out by the authors, the stochastic samples may not recover the highest points of the distributions. This situation is common with statistical models and can explain the variability in the corresponding Wasserstein distances.

Fig. 2
figure 2

Left: histograms of the original monthly temperature data (region #10), corresponding stochastic samples from the authors, and decompressed data from SZ3. Right: Wasserstein distance between the original data and, from left to right, stochastic samples from the authors and SZ3 reconstruction with error = 0.1, 1, and 12.9

We have highlighted some differences between both methods. Since each technique has its own advantages and drawbacks, these differences could be leveraged as synergies in future work depending on the user’s priority, which could be storage, time performance, or generative aspects.

figure a

2 Stochastic Emulators

2.1 Proposed Stochastic Generators

As noted by the authors, the stochastic simulations of different fields can represent a challenge (Ailliot et al. 2015a). Wind fields comprising wind intensity and direction, which necessitate a circular treatment, require regime switching because both variables are linked to weather types (Ailliot and Monbet 2012; Ailliot et al. 2015b). Precipitation is another challenging variable since it combines an occurrence and a rainfall amount variable (Katz 1977; Thompson et al. 2007; Kleiber et al. 2012). Additionally, to mimic operational settings, one would need to simulate concurrently multiple variables together. This remains a challenging aspect of statistical models. Richardson (1981) and Parlange and Katz (2000) pioneered single-site models for multiple variables; however, the addition of a spatial component has rarely been attempted since then.

The authors mentioned the difficulty of modeling the bulk and tails of a distributions. Recent developments in that vein have been proposed for unidimensional probability distribution functions (PDFs); see (Tencaliec et al. 2020; Stein 2021a, b). These models seek to capture low, moderate, and high values of a variable of interest into a single PDF model. However, the models require additional layers of modeling to account for spatial and temporal dependencies and their nonstationarities.

The study of the minimal number of training samples from the CESM-LENS is crucial and interesting. We expect that this number of training samples would increase if one were to look at other variables such as wind or precipitation and also at different temporal scales such as subdaily scales. Determining a reduced number of ensemble members that are representative of the ensemble could also be a way to think about storage saving, but it requires defining statistically and mathematically the concept of “representativeness.”

figure b

2.2 Next Generation of Weather and Climate Models

As discussed by the authors, the need to rethink climate model simulations is critical as the increasing computational power and high-resolution modeling lead to larger and larger outputs. As an example discussed by Klöwer et al. (2021), the European Centre for Medium-Range Weather Forecasts produces 230 TB of data on a typical day, and this data production is expected to quadruple within the next decade because of the increased spatial resolution of the forecast model (Bauer et al. 2020). In that vein, recent works have studied the use of single and mixed precision in climate modeling in order to tackle large amounts of data while ensuring forecast quality (Váňa et al. 2017; Tintó Prims et al. 2019). In the meantime, in a world of changing climates, Loft (2020) pleads for greener Earth system modeling that reduces the carbon footprint of weather and climate models; he highlights the need to “develop machine learning algorithms to avoid unnecessary computations.” This statement resonates with the current paper and further applications of stochastic emulators, in particular since the authors mention in their paper that statistical models are especially suited for fine spatiotemporal scales.

figure c