WaveRange: wavelet-based data compression for three-dimensional numerical simulations on regular grids

A wavelet-based method for compression of three-dimensional simulation data is presented and its software framework is described. It uses wavelet decomposition and subsequent range coding with quantization suitable for floating-point data. The effectiveness of this method is demonstrated by applying it to example numerical tests, ranging from idealized configurations to realistic global-scale simulations. The novelty of this study is in its focus on assessing the impact of compression on post-processing and restart of numerical simulations.

• Nature of problem: WaveRange compresses three-dimensional floatingpoint data produced by computational physics solvers. It reads the input data stored in files as three-dimensional floating-point arrays, and generates smaller compressed files. Subsequently, WaveRange reads compressed files and approximately reconstructs the original data approximately with controlled accuracy.
• Solution method: Lossy data compression is achieved by application of a three-dimensional discrete wavelet transform, quantization and range coding. The quantization method is designed such as to ensure the required reconstruction error. Reconstruction is the inverse of the compression steps in the reverse order.
• Additional comments including Restrictions and Unusual features: Wa-veRange can be used as a standalone application or as a library. Currently supported data formats in the application mode are 'generic' Fortran/C/C++, FluSI and MSSG output and restart files.
• References: All appropriate methodological references are contained in the section entitled References.

Introduction
Partial differential equations often arise in physical sciences from threedimensional (3D) continuum models, yielding boundary value problems for continuous field variables defined over 3D spatial domains. Numerical solution of these problems involves discretization and, among all available methods, many employ regular grids such that the discretized field variables can be stored as three-dimensional arrays. Regular grids prevail in the highperformance computing (HPC) for enabling fast and efficient implementation of high-order numerical methods with good parallel scalability. HPC simulations based on Cartesian grids are extremely diverse and include, to name a few, particle-laiden fluid flows [1], solar flares [2] and neutron transport inside the core of a nuclear reactor [3]. In Earth science, curvilinear grids such as Yin-Yang [4] or Cubed Sphere [5] are used for global weather simulations [6], geodynamo simulations [7], calculation of seismograms [8], etc. Toroidal grids are used in tokamak plasma simulations [9].
HPC produces large volumes of output data. A numerical simulation using several hundred or thousands of processor cores would allocate threedimensional arrays totaling to several Gigabytes or Terabytes. If the solution evolves in time, a new three-dimensional data set is produced at every time step. This massive data flow is characteristic of big data applications [10], and it is not surprising that high-performance numerical simulations using regular grids hit the limitations of the contemporary data handling technologies. In particular, data storage capacity is finite. To alleviate this constraint in practice, in situ data reduction is routinely performed during the simulation and only the selected integral physical quantities or time-resolved sequences are stored. Nevertheless, it is often required to store the three-dimensional fields for purposes such as scientific visualization, restart of the simulation or additional post-processing. These large datasets quickly saturate the available disk space if stored as floating-point arrays without compression.
Lossless data compression tools, such as LZMA compression, reduce the typical floating-point binary data file size by less than 20%. Accepting some data loss, it is a common practice to store the simulation output fields in single precision and downsample the data, e.g., save every second point value in each direction. Such reduced datasets, while being of insufficient information capacity for the simulation, are often suitable for postprocessing. Given that the differences between neighboring point values can be interpreted as wavelet coefficients, a more refined version of the mentioned approach is to apply wavelet transform to the field and encode the significant portion of the wavelet coefficients using a common data compression method such as entropy coding. This technique is currently widely in use for image compression, being part of the JPEG2000 standard [11]. Its suitability for the computational fluid dynamics (CFD) data compression has been evaluated in [12], alongside other image compression algorithms, and later revisited in [13]. A related method has been recently implemented for multiresolution rendering and storage of geoscience models with discontinuities [14]. In [15,16,17], wavelet compression has been used in the context of numerical simulation of industrial fluid flows using the building-cube method, with focus on aeroacoustics. Overall good performance has been reported in terms of the compression ratio, accuracy and parallel performance for large datasets. However, error control was not explicitly handed.
Of all types of output, the restart data may pose the most stringent accuracy constraint on the lossy compression, because it is commonly expected that restart should not influence the final result of the simulation. One can expect that the required restart data accuracy depend on the physical model. Indeed, there exist models that are insensitive to ample reduction in the width of the floating-point significand [18]. The use of lossy data compression techniques has also been advocated by showing that compression effects are often unimportant or disappear in post-processing analyses [19], and substantial gain in the compression ratio can be achieved while keeping the error at acceptable level in terms of physically motivated metrics [20].
Consequently, it appears reasonable to adjust the restart data storage to the precision justified by the level of model error. We further investigate into this issue by considering two atmospheric dynamics simulations in the present paper.
In the context of fluid dynamics and atmospheric science, since the ability of wavelets to provide compressed representation of turbulent flows was recognized [21], a significant body of research focused on the development of wavelet-based adaptive numerical methods allowing to lower the computational complexity and memory requirements of high-Reynolds number flows simulations [22]. Studies taking the perspective of CFD data storage remain relatively sparse. Besides the aforementioned work, a wavelet transformvector quantization compression method for ocean models was proposed in [23], the effect of lossy wavelet-based compression on barotropic turbulence simulation data has been studied in [24], a hybrid method with supercompact multiwavelets was suggested in [25], tradeoffs in accuracy, storage cost and execution times using different wavelet transforms were considered in [26].
It should be mentioned that wavelet bases are not the only that yield sparse representation of turbulent flow fields. Decompositions such as POD [27,28] or DMD [29] are also used for this purpose, and employed in CFD output data compression methods [30,31]. Each method has its own advantages, but in this paper we only consider the wavelet-based approach that may be more suitable for large datasets for its lower computational complexity, compared with the POD or DMD. A comparison of dimensionality reduction using POD and wavelet coherent structure identification can be found in [32]. The wavelet-based method presented in this work does not require any time history, i.e., it can be applied to compress a single time snapshot.
Sub-band coding (SBC) [33] and the use of more general filter banks than the discrete wavelet [34] have been considered in the context of seismic data compression. Another, conceptually different family of methods can be described as prediction-based compression algorithms (see [35,36,37] and references therein) that exploit spatio-temporal patterns in the data. A comparative discussion of different existing approaches to scientific data reduction, including lossy data compression, can be found in a recent review paper by Li et al. [38].
The first objective of this work is to implement a data compression method suitable for files that contain three-dimensional floating-point arrays output from numerical simulation. We mainly target applications in Earth science such as atmospheric dynamics simulation, but the method and the software are designed to fit broader use. Our approach is similar to [16] con-ceptually, but differs in many aspects such as error control, wavelet transform depth, etc. Therefore, a self-contained description of the method is provided in Section 2. The computer code is implemented in C/C++, it is open-source and accessible via https://github.com/pseudospectators/WaveRange. It is described in Section 3. Our second objective is to evaluate the performance of the method and to devise practical recommendations for users. This constitutes Section 4 of the paper. We particularly focus on the relationship between reconstruction error and compression ratio, as well as its effect on the accuracy of post-processing and simulation restart. The compression and decompression performance with consideration of computational cost is examined in Section 5. Section 6 contains concluding remarks.

Problem definition and description of the method
We restrict our attention to data sampled on single or multi-block grids with each block using three-dimensional Cartesian indexing, as shown in Fig. 1. Numerical methods that involve such kinds of topology are common in HPC for the ease and efficiency of data management. Atmospheric flow simulations of the global scale can be performed using a Yin-Yang grid that consists of two overlapping blocks. Regional and urban simulations can incorporate geometrical representation of landscape features and buildings by using immersed boundary approaches [39,40] that effectively reduce the computational domain to a rectangular box.
Let {f ix,iy,iz } be a three-dimensional scalar field sampled on a grid {x ix,iy,iz , y ix,iy,iz , z ix,iy,iz }, where i x = 1, ..., n x ; i y = 1, ..., n y ; i z = 1, ..., n z . This grid may be curvilinear. However, the positions of the grid nodes in the physical space are not required by the data compression algorithm. Only the array of values f ix,iy,iz and the number of grid points in each direction n x , n y and n z constitute the input data. If the grid is multiblock, each block can either be treated independently by the compression algorithm, or the blocks can be merged in one array if their dimensions match. Therefore, in the following discussion we will refer to the rectilinear grid schematically shown in Fig. 1, without loss of generality.
On the highest level, the compression method consists of the following steps: (I) wavelet transform; (II) quantization; (III) entropy coding. Reconstruction is the inverse of the above operations in the reverse order. These steps are schematically shown in the data flow diagram in Fig. 2. Note that Figure 1: Schematic drawing of a Yin-Yang grid [4] and a Cubed Sphere grid [5] used in Earth science computations, and a Cartesian grid block with point-value data stored at the grid nodes.
'dequantization' is not the exact inverse of the quantization, i.e., F =F in general.

Wavelet transform
Wavelet transform can provide compressed representation of a signal if it is locally correlated, and it becomes particularly efficient if the fine-scale activity is sparse. For example, turbulent flow fields satisfy these criteria, and it is known that only few wavelet coefficients are sufficient to represent the dynamically active part of the flow [22].
The wavelet transform produces an array of real values {F ix,iy,iz } with the same number of elements as in the original array {f ix,iy,iz }, i.e., containing n x × n y × n z elements in total. We use a three-dimensional multiresolution transform based on the bi-orthogonal Cohen-Daubechies-Feauveau 9/7 (CDF9/7) wavelet, expressed in terms of lifting steps [41,42]. Numerical experiments with sample CFD velocity data confirmed that this transform yields higher compression ratio than using lower order biorthogonal wavelets or orthogonal wavelets such as Haar, Daubechies (D4...D20) and Symlets (Sym4...Sym10). Previous work [26] also showed that the CDF9/7 wavelet performed better than the Haar wavelet for scientific data compression, albeit using thresholding of the wavelet coefficients rather than entropy coding.
Basically, the transform consists of a finite sequence of filtering steps, called lifting steps, applied to one-dimensional (1D) signal. Let g i , i = 1, ..., m be a 1D array of real values. One level of the forward transform consists in calculating the values of approximation coefficients s j , j = 1, ..., ⌈m/2⌉ and detail coefficients d j , j = 1, ..., ⌊m/2⌋ (⌈·⌉ and ⌊·⌋ are the ceiling and the floor functions, respectively) using lifting steps where α = −1.5861343420693648, β = −0.0529801185718856, γ = 0.8829110755411875, δ = 0.4435068520511142 and ζ = 1.1496043988602418. For boundary handling, coefficients s 0 and d (2) 0 are defined and set identical to zero. If m is odd, then the missing element d (0) ⌈m/2⌉ , which is necessary for calculating d (1) ⌈m/2⌉ and s (1) ⌈m/2⌉ subsequently, is determined using extrapolation The output of (1) is an array of size m in which the first ⌈m/2⌉ elements contain the approximation coefficients s j and the last ⌊m/2⌋ elements contain the detail coefficients d j .
The inverse transform admits coefficients s j and d j at input, and resolves the lifting steps (1) in the reverse order to produce g j at the output. More specifically, s and so on. The three-dimensional transform is constructed by applying the above one-dimensional transform sequentially in the three directions of the threedimensional data array. WaveRange does one level of the 1D wavelet transform in the x direction, then in y, and finally in z. After that, it moves to the next level: the approximation coefficients on the first level are taken as input data for 1D transforms. Then, the same procedure is repeated on the second level, etc., as explained in Algorithm 1. We found that repeating the process 4 times is practically sufficient to reach the maximum compression ratio, because, after 4 levels of the transform, the number of approximation coefficients becomes as small as 1/4096 of the total number of elements in the dataset. We therefore always set L = 4 in WaveRange. This means that, as long as n x ≥ 15, n y ≥ 15 and n z ≥ 15, the same number (four) of 1D transform levels is realized in all three directions, even in those cases when there is a dramatic difference between n x , n y and n z . Note that the transform is in-place, i.e., {f ix,iy,iz } and {F ix,iy,iz } physically share the same memory by being stored in the same array.
The inverse transform has similar algorithmic structure. It starts with the approximation coefficients and the details at the largest scale, and repeats L iterations adding one extra level of details in each direction on each iteration.
An example field and its transform are displayed in figure 3. On the left, the original field in physical space, {f ix,iy,iz }, is displayed. In this example, it contains point values of the velocity component u x in a turbulent wake flow, which is described in Section Appendix A. The magnitude of the point data values is shown on a logarithmic color scale. On the right, the output of Algorithm 1 is shown, in which the approximation coefficients and the detail coefficients on all levels are packed in one three-dimensional array Data: 3D array of point values in physical space f ix,iy,iz , i x = 1, ..., n x , i y = 1, ..., n y , i z = 1, ..., n z . Result: approximation coefficients and detail coefficients placed in the same array F ix,iy,iz . m x ← n x ; m y ← n y ; m z ← n z ; if m x > 1 then for i y = 1 to m y ; i z = 1 to m z do copy data elements F jx,iy,iz in a contiguous array g jx , where j x = 1, ..., m x ; apply the 1D wavelet transform (1) and copy the result in F jx,iy,iz ; end end if m y > 1 then for i x = 1 to m x ; i z = 1 to m z do copy data elements F ix,jy,iz in a contiguous array g jy , where j y = 1, ..., m y ; apply the 1D wavelet transform (1) and copy the result in F ix,jy,iz ; end end if m z > 1 then for i x = 1 to m x ; i y = 1 to m y do copy data elements F ix,iy,jz in a contiguous array g jz , where j z = 1, ..., m z ; apply the 1D wavelet transform (1) and copy the result in F ix,iy,jz ; end end m x ← ⌈m x /2⌉; m y ← ⌈m y /2⌉; m z ← ⌈m z /2⌉; end Algorithm 1: Three-dimensional wavelet transform.
|f|, |F| data in physical space approximation coefficients detail coefficients Figure 3: Example visualization of a 3D dataset and its wavelet transform.
{F ix,iy,iz }. Seven-eighth of its elements correspond to the smallest-scale detail coefficients. They are all small in magnitude: most of them are below the visibility threshold of the selected color scale, and only few large ones appear in the boundary layer. On the next and subsequent levels, details in the turbulent wake become increasingly larger in magnitude. The approximation coefficients occupy the upper-left corner of the domain. They are small in number and large in magnitude. Note that the transform is near lossless (in the terminology of [38]) such that all values of {f ix,iy,iz } can be calculated from {F ix,iy,iz } with floating point round-off accuracy.

Quantization
From this point on, {F ix,iy,iz } is treated as a one-dimensional array of real values F i , i = 1, ..., N, where N = n x × n y × n z . Quantization represents each element F i as a set of 1-byte numbers (i.e., integer numbers ranging from 0 to 255), as required for the subsequent entropy coding. In general, entropy coders are not limited to 256 size alphabet, but this size is convenient as it corresponds to the char type, native in C language. A double-precision floating point variable F i can be losslessly quantized using eight 1-byte numbers. However, in applications related to the numerical simulation of turbulent flows, quantization with some data loss has to be accepted in order to achieve the desired high compression ratio.
Lossy compression requires less than eight 1-byte integer numbers Q j i per variable F i , indexed with a superscript j , using the following approxi-mation: with the approximation error no greater than δ J . The latter is controlled by the relative tolerance ε, which is a user-specified parameter of the compression routine. Note that ε can be regarded as the desired L ∞ error in f in physical space, normalized with max |f |, but δ J in (3) sets the absolute error in F i in wavelet space. To relate δ J with ε, we define where the maximum is taken over all elements, and η is a constant coefficient. By trial and error we have found that the value η = 1.75 guarantees that ||f −f || ∞ / max |f | ≈ ε as long as we use four levels of the wavelet transform. Quantization adds random noise with amplitude ε F to the wavelet coefficients F , i.e., max |F −F | = ε F . The pointwise error in the physical space, f −f , is a weighted sum of F −F , with the weight determined by the lifting coefficients and by the length of the filter. From this consideration, it may be possible to evaluate η analytically, but the empirical value of 1.75 proves acceptable in all test cases considered in this paper, see Appendix B. We then assign δ J = ε F . The values of Q j i , δ j , F j min and J are determined as follows from Algorithm 2, where we use square brackets to denote the nearest integer. An example graphical illustration of this algorithm is presented in figure 4. In the example, J = 3 bit planes are required to represent the floating-point Algorithm 2: Quantization algorithm.

Entropy coding
Entropy coding is a technique allowing to reduce the quantity of storage required to hold a message without any loss of information. The message can be any sequence of characters drawn from some selected alphabet. The implementation that we use in the present work is based on an alphabet of 256-bit symbols, i.e., integers from 0 to 255. Conceptually, entropy coding consists in representing frequently occurring sub-sequences with few bits and rarely occurring ones with many bits. Shannon's coding theorem serves the entropy of a message as a theoretical bound to possible lossless compression [43].
The entropy coding method that we employ in our present work is the range coding [44]. Our current implementation uses an open-source coder rngcod13 developed by M. Schindler [45]. The size of the encoded files produced by this coder is within a fraction of per cent from the theoretical bound, and the operation speed is faster compared to other similar methods such as arithmetic coding.
The input message consists of the bit planes Q j i obtained during the quantization step. These data arrays are divided in blocks of 60,000 elements. The frequencies of each value in a block are counted and copied to the output stream of the coder. The data block is then encoded using the calculated frequencies. Given a stream of 256-bit symbols Q j i and their frequencies, the coder produces a shorter stream of bits to represent these symbols. The output stream is written in a file, appended with metadata necessary for decoding.

Software framework
This section contains an overview of the software. Installation instructions and user's notes can be found in https://github.com/pseudospectators/WaveRange/blob/master/README.md

Software functionalities
WaveRange application includes an encoder that compresses the floating point data and a decoder that reconstructs the data from the compressed format. Two different executables are built for these two purposes, respectively, when WaveRange is used as a standalone application. WaveRange's 'generic' interface can read and write plain Fortran and C/C++ files containing one or several multi-dimensional arrays. In addition, the current version of Wav-eRange includes 'specialized' input and output interfaces compatible with a spectral incompressible Navier-Stokes solver FluSI [46] and with MSSG, the Multi-Scale Simulator for the Geoenvironment [47]. Each of these two solvers generate two types of large files: regular output for, e.g., flow visualization, and restart files. WaveRange can compress both. The input floating point data for compression may be stored in one file or divided in multiple files, in a format specific to the computation software employed.
WaveRange can also be used as a library. In that case, the encoder and the decoder functions are called from the user's program. The original, the encoded and the reconstructed datasets are passed as parameters. Disk input/output is to be implemented by the user.

Software architecture
WaveRange is written in C and C++. The lower-level functions related with the discrete wavelet transform and range coding are in C. Their source codes are stored in the directories waveletcdf97_3d/ and rangecod/, respectively. On a higher level, the C++ code in core/ implements the functions void encoding_wrap and void decoding_wrap that execute all operations necessary for compression and reconstruction, respectively, in accordance with the data flow diagram in Fig. 2. These functions are called after the input data files are read and before the output files are written. Two similar functions, void encoding_wrap_f and void decoding_wrap_f, offer compatibility with Fortran. The int main functions are specific to each interface. The respective C++ source files are contained in the directories generic/, flusi/ and mssg/, the executables are generated in bin/generic/, bin/flusi/ and bin/mssg/. Note that the FluSI interface requires the hierarchical data format (HDF5) library [48], but the generic and MSSG interfaces have no external dependencies. The WaveRange library files appear in bin/lib/.

Illustrative examples
In the subsequent sections of this paper, we discuss numerical examples that serve to evaluate the performance of the method and to gain better understanding of different properties of the compressed data. A homogeneous isotropic turbulence dataset, which can be regarded as a highly idealized representation of atmospheric flow, is examined in Section 4.1. Quantitative measures such as the error norm, the relative compressed file size and the compression ratio are introduced, their scaling with respect to the tolerance and the data size is analyzed. A similar analysis with application to seismology is discussed in Section 4.2. After gaining insight into the basic typical features of the compressed data, the discussion proceeds to realistic atmospheric flow simulations. In Section 4.3, a global weather simulation for typhoon prediction is considered, with special attention to restart of the simulation from a compressed restart data file. Restarts using different levels of compression tolerance are tested. Restarts of an urban-scale weather simulation are discussed in Section 4.4.

Fluid turbulence simulation
The homogeneous isotropic turbulence (HIT) dataset considered here is similar to the velocity fields analyzed in [49]. It was obtained by integrating the incompressible Navier-Stokes equations in a 2π-periodic cube box, in dimensionless units, using a fourth-order finite-difference scheme, second-order Runge-Kutta time marching, and HSMAC velocity-pressure coupling [50]. The flow domain was discretized using a uniform Cartesian grid consisting of n 3 = 512 3 staggered grid points. Power input was supplied using external isotropic forcing to achieve a statistically-stationary state. A snapshot velocity field (u,v,w) was stored in double precision, each component occupying 1 GB of disk space.
The dataset used in the present study is visualized in Fig. 5(a) using an iso-surface of the vorticity magnitude, and the energy spectrum is shown in Fig. 5 respectively. The turbulence kinetic energy is equal to K = 1.44. With the dissipation rate ǫ = 0.262, the Kolmogorov length scale is equal to η = (ν 3 /ǫ) 1/4 = 0.00845, such that k max η = 2.16, where k max = n/2. The Taylor micro scale λ = 10νK/ǫ = 0.246 yields the Reynolds number Re λ = 2K/3λ/ν = 218. Homogeneous isotropic turbulence presents a challenge for the data compression. Turbulent flow fills the entire domain with fluctuations at all scales, the smallest scale being of the same order as the discretization grid step (see figure 5b).
Before analyzing the compression performance in this case, let us discuss the error control properties. Among the variety of physically motivated error metrics we choose L ∞ , which is intuitive and perhaps the most stringent. Other, problem specific metrics will be discussed later on. Figure 6(a) shows the relative L ∞ error of the velocity field reconstructed from compressed data. The compression algorithm takes the tolerance ε as a control parameter. This allows to plot the reconstruction error as a function of ε. Each velocity component is scaled by its maximum absolute value, yielding where f stands for one of the components (u, v or w) of the original velocity field, andf is its reconstruction from the compressed data. The maxima and the minima are calculated over all grid points. In figure 6(a), the dash-dot diagonal line visualizes the identity relationship between ε and e ∞ . The actual data for all three velocity components closely follows this trend, with the discrepancy only becoming noticeable when ε is smaller than the accumulated roundoff error, and for large ε when the discrete nature of quantization becomes apparent as there remain only few non-zero detail coefficients. For all ε ∈ [10 −14 , 10 −3 ], the difference between ε and e ∞ is less than 15%, i.e., the desired error control is successfully achieved. Figure 6(b) shows a plot of the relative compressed file size versus e ∞ , where s(f ) = 1 GB is the disk space required to store a 512 3 array of double-precision values, and s(ϕ) is the respective compressed file size in GB. The dash-dot diagonal line in this plot represents the storage requirement for a 512 3 array of hypothetical custom-precision values, which is 100% in the case of double precision, 50% for the single precision, 0 for the full loss of information, and all intermediate values are linearly interpolated. This line sets a reference it terms of the compression achievable by simply quantizing the floating-point array elements with tolerance ε and using smaller, but constant number of bits per element. Its slope is equal to 6.39% of storage per one decimal order of magnitude of accuracy. The point markers connected by solid lines show the amount disk space actually used by the compressed velocity fields files, and the respective reconstruction error. When the accuracy is in the range between 10 −14 and 10 −3 , it is well approximated with a fitting Σ HIT = (−0.12 − 0.052 log 10 e ∞ ) × 100%, which is shown with a dotted magenta line. The negative intercept can be explained as follows. When ε (and, consequently, e ∞ ) is large, only a few significant bits are sufficient to represent the field (u, v, w) with the desired accuracy. Hence, the wavelet transform after quantization becomes sparse, and it is very efficiently compressed by entropy coding. In the intermediate range of ε, increasing accuracy by one order of magnitude comes at a cost of 5.2% increased storage. Least significant bit planes of the wavelet coefficients are not sparse, they are noise-like, and their lossless compression ratio by entropy coding asymptotically tends to a fairly low value typical of noise, 0.0639/0.052 ≈ 1.23, in the hypothetical limit of ε → 0. However, for ε < 10 −14 , accumulation of roundoff errors becomes significant, Σ sharply increases to 76%, while e ∞ saturates at 5 × 10 −15 . Roundoff error could be avoided by switching to integer wavelet transform ensuring perfect reconstruction. Extrapolation of the linear trend suggests that this approach may be superior than, for example, direct application of LZMA encoding, as shown by crosses in figure 6(b). It is also noteworthy that compression with ε = 10 −8 allows to reduce the data storage by a factor of 3, which is significantly better than using the native single-precision floating-point format.
Compression with ε = 10 −6 increases this ratio to 5. On the other hand, from the figure it is apparent that multi-fold reduction of the file size, which is our objective, entails data loss. Compression ratio is a commonly used performance metrics, particularly suitable in those situations when the compressed file size is many times smaller than the original size. Thus, Figure 6(c) reveals that hundredfold compression is attainable by setting ε = 10 −2 . This may be a good setting for the purpose of qualitative flow visualization, for example, as illustrated in figure 5(a): vortex filaments consist of many small cyan and magenta patches, which means that two iso-surfaces coincide almost perfectly, namely, the cyan iso-surface that visualizes the vorticity magnitude calculated from the original velocity, and the magenta iso-surface that is obtained using the velocity field reconstructed from the compressed data with ε = 10 −2 . Figure 5(b) confirms that the error mainly contaminates the smallest scales, while the energy-containing largescales are much less affected. This observation supports the interpretation of quantization as adding thermal noise to the original data. Note that we chose ε = 10 −2 to magnify the numerical error. For ε = 10 −6 or less, the reconstructed spectrum would be visually identical with the original. Fi- nally, figure 6(c) confirms that asymptotic scaling (7) is realized as soon as e ∞ < 10 −3 .
Let us now consider how the compression ratio r scales with the size of the dataset. There are at least two obvious ways to obtain a smaller HIT dataset from the original 512 3 arrays: down-sampling and sub-domain extraction. The former means that only every 2nd (or 4th, etc.) grid point in each direction is retained, all other data points are discarded. The latter means that only the first 256 (or 128, etc) points in each direction are retained. We have thus constructed two sequences of datasets of size n 3 = 32 3 , 64 3 , 128 3 , 256 3 and 512 3 , where the largest dataset is the original one. The compression ratio using tolerance ε = 10 −6 is displayed in figures 7(a) and (b). Figure 7(a) shows r of the down-sampled data. Starting from the rightmost point, there is a major decrease in r between n = 512 and 256, i.e., after discarding every second point. It is followed by a slower decay that accompanies subsequent down-sampling. The original dataset is a result of direct numerical simulation of turbulence, which implies that the inertial range is fully resolved, i.e., the distance between the neighboring grid points is smaller than the Kolmogorov scale. Consequently, the turbulent velocity fluctuation at the smallest scale contains disproportionately less energy than any larger scale in the inertial range. This means that the numerical values of the smallest-scale wavelet coefficients contain much less non-zero significant digits. Therefore, they are compressed much more efficiently. In the inertial range, the number of non-zero significant digits becomes larger as the wavenumber decreases, which explains further gradual decrease of r with decreasing n. The evolution of r with n is thus related to the decay of the Fourier energy spectrum. Figure 7(b) shows r of the sub-domain data. In this case, the energy per unit volume of the smallest-scale velocity fluctuation field does not depend on n. Hence, smallest-scale wavelet coefficients are of the same order of magnitude, regardless of n. As a result, in this case, the compression ratio r varies less with n than in the previous case. It is practically constant, r = 5.5, between n = 128 and 512. There is, nevertheless, some moderate decrease down to r = 4.5 at n = 32. It can be explained by the efficiency of the entropy coding becoming lower as the dataset becomes smaller.
These scalings confirm that the chosen method of compression is particularly suitable for datasets produced by large-scale, high-resolution numerical simulations. In addition, the largest compression ratio will be achieved if the data are stored in one single file rather than divided in multiple subdomain files treated independently. Further, Appendix A investigates into the effects of spatial inhomogeneity by considering fluid flow past a solid cylinder.

Seismology simulation
Let us proceed with an example from seismology. Similarly to the analysis in the previous section, we compress an example dataset with a given ε, measure the compressed file size, then reconstruct the fields from the compressed format and measure the L ∞ error e ∞ . The example dataset is a restart file for a synthetic seismogram computation of a large earthquake using SPECFEM3D, a spectral-element code based on a realistic fully threedimensional Earth model [8]. In the simulation, the domain is split in as many slices as the number of processors used. Data that correspond to different slices are stored in separate files, and, in this example, we only process one of them. The file contains 31 single-precision arrays, but only the first 9 are large: the displacement, velocity and acceleration of the crust and mantle (3 × 19904633 elements each), of the inner core (3 × 1270129 elements each) and of the outer core (2577633 elements each). Only these arrays are compressed, while the remaining 22 small arrays containing in total 5250 single-precision numbers are directly copied in the end of the compressed file. Note that, although the grid is three-dimensional, the spectral element solver packs the variables in one-dimensional arrays. This packing preserves

Global-scale simulation of tropical cyclones
In this section, we evaluate the compression of restart files produced in a global weather simulation using the Multi-Scale Simulator for the Geoenvironment (MSSG), which is a coupled non-hydrostatic atmosphere-ocean-land model developed at the Center for Earth Information Science and Technology, Japan Agency for Marine-Earth Science and Technology (JAMSTEC) [47]. Its atmospheric component includes a Large-Eddy Simulation (LES) model for the turbulent atmospheric boundary layer and a six-category bulk cloud micro-physics model [51]. Longwave and shortwave radiation transfer is taken into account using the Model simulation radiation transfer code version 10 (MstranX) [52]. In the global weather simulation mode, MSSG uses a Yin-Yang grid system in order to relax Courant-Friedrichs-Lewy condition on the sphere. Higher-order space and time discretization schemes are employed.
In [6], nine tropical cyclones were simulated within the framework of the Global 7km mesh nonhydrostatic Model Intercomparison Project for improving TYphoon forecast (TYMIP-G7). The main objective of that project was to understand and statistically quantify the advantages of high-resolution global atmospheric models towards the improvement of TC track and intensity forecasts. Thus, in MSSG, each horizontal computational domain covered 4056 × 1352 grids in the Yin-Yang latitude-longitude grid system. The average horizontal grid spacing was 7 km. The vertical level comprised 55 vertical layers with a top height of 40 km and the lowermost vertical layer at 75 m. It was recognized that, when requiring the output data for every 1 or 3 h over 5-day periods be stored for analyses, the total volume of storage summed up to a huge amount.
Here, we use the initial data of July 29, 2014 at 12:00 UTC. The dataset is stored in multiple files. They include a text header file that contains descriptive parameters of the dataset such as its size, hydrodynamic field identifiers, domain decomposition parameters, etc. The hydrodynamic field variables are stored in 1024 binary files, each containing all n f = 12 fields within the same sub-domain. Thus, each sub-domain contains n f × 254 × 43 × 55 grid point data in double precision, totaling to 55 MB of data in one file. First 512 of these files belong to the Yin grid and the remaining 512 to the Yang grid.
The compression can either be performed on each data subset file independently or, alternatively, the subsets can be merged before applying the wavelet transform. It is noticed in Section 4.1 that the compression ratio has a tendency to increase with the data size. Hence, all subsets of each hydrodynamic field are concatenated as parts of a three-dimensional array of size 4064 × 2752 × 55. The fields are processed sequentially requiring 4.6 GB of RAM for the input array plus up to 6.4 GB for the encoded output data and for temporary arrays.
First, let us quantify the compression performance of this restart dataset. Figure 9(a) shows that the relative error e ∞ varies almost identically to the threshold ε, where e ∞ is calculated as the maximum relative error over all data points of all fields. It saturates at the level of 10 −14 due to round-off. The compressed file size, shown in figure 9(b), is significantly smaller than in the previously considered cases of turbulent incompressible velocity data. The linear fit Σ T = (−9.5 − 2.7 log 10 e ∞ ) × 100%, shown with a green dotted line, has a greater offset from the dash-dot diagonal and a less steep slope compared with the HIT fit (7), which is shown with a magenta dotted line. The global weather simulation is more complex than the incompressible Navier-Stokes, the fields contained in the restart files are heterogeneous and have sparser wavelet transform that compresses more efficiently.
To measure the effect of lossy restart data compression on the simulation accuracy, the following protocol was implemented.
• Compress the original restart data with some given tolerance ε; • Using the compressed file, reconstruct the full-size restart data; • Restart the weather simulation and let it continue for 120 hours of physical time; • Compare the time evolution of selected physical quantities with the original simulation not using data compression.
We focus on typhoon Halong. Figures 10(a), (b) and (c) show the typhoon core trajectory, minimum pressure in the core and the wind speed, respec- tively. The best track from observation is shown using the black color, the result of the original simulation is shown using the red color. The typhoon trajectory is predicted well during the first three days, after that it deviates more to the north during the simulation. The predicted pressure drop and the increase of wind speed are slightly advanced in time compared with the observation data, but the maximum wind speed is evaluated accurately. The results of the restarts with ε = 10 −16 , 10 −6 and 10 −4 are shown with different colors. All of them visually coincide with the original result during the first 24 hours, after which the discrepancy grows large enough to be visible, but it remains in all cases significantly smaller than the difference with respect to the observation.
For ε = 10 −3 or larger, it was found impossible to restart the simulation because of numerical instability. In fact, the onset of such instability is already noticeable in the case of ε = 10 −4 , as the wind speed becomes slightly different in the very beginning of that simulation. We investigate further on this effect by plotting the difference between the restart and the original results on a logarithmic scale. We consider the L 2 error norm of the wind speed, obtained by summation over all grid points in latitude-longitude square window Ω of size 10 • × 10 • centered on the typhoon as predicted in the original simulation, where U restart and U original is the velocity magnitude in the restart simulation and in the original simulation, respectively. For all simulations with ε ≤ 10 −5 , the error increases polynomially as the power ≈ 2 of the physical time after the restart point, until saturation after about 72 hours. This trend arises from the nonlinear dynamics of the system and it shows no distinguishable deterministic relation with ε as long as the latter is sufficiently small. For ε = 10 −4 , the error increases rapidly during the first time iterations, but then it decreases and ultimately follows the same trend as described previously. It is apparent that larger values of ε entail faster initial error growth and, ultimately, numerical divergence that cannot be accommodated by the physical model. From figure 9 we notice that successfull restarts belong to the intermediate regime (where Σ evolves according to equation (9)) and to the high-accuracy tail of the compression diagram, such that the compressed file size is greater than 2.5, i.e., the compression ratio is less than 40. The low-accuracy end of the diagram showing the file size of less than 1% can be practical for the data archiving only if it is not intended as input for restarting the simulation. Taking into consideration these opposing requirements of simulation reliability and data storage efficiency, it is advisable to compress the restart data of global weather simulations with the relative tolerance of ε = 10 −6 .

Urban-scale simulation
In [40], a tree-crown resolving large-eddy simulation coupled with a threedimensional radiative transfer model was applied to an urban area around the Tokyo Bay. Source terms that represent contributions of the ground surface, buildings, tree crowns, and anthropogenic heat, were integrated within the MSSG model in order to perform urban-scale simulations. In particular, tree crowns were taken into account using the volumetric radiosity method. The landscape was set based on geographic information system (GIS) data from the Tokyo Metropolitan Government. The initial and side-boundary atmospheric conditions were imposed by the linear interpolation of the mesoscale data provided by the Japan Meteorological Agency. The computational domain was a rectangular box discretized with uniform grid step (5 m) in two perpendicular horizontal directions, and slightly stretched in the vertical direction (from 5 m near the sea level to 15 m in the upper layers). We consider restart data for a simulation using N = 2500 × 2800 × 151 grid points. Only the hydrodynamic fields are compressed since all other restart data use much less disk space. The domain is decomposed in equal Cartesian blocks of 50 × 25 × 151 points. These data sets are stored in 5600 files, each containing 21 hydrodynamic fields. The files occupy 166 GB of disk space in total.
As shown in figure 12, data compression dramatically reduces the storage requirement. We have compared two approaches. The first ("united file") is to read the data from all sub-domains and concatenate in one array per field, then transform and encode each field, and write all encoded data in one binary file. This is the same procedure as used in Section 4.3. The second approach ("divided file") consists in processing each sub-domain independently, producing 5600 compressed binary files. Since it does not require any communication between sub-domains, processing can be executed in parallel with ideal speedup. The parallel speedup comes at a price of larger compressed file size, by ≈ 2% of the original data size. The sub-domain files are smaller than the united file, therefore, their compression ratio is overall lower, as explained in Section 4.1. In addition, part of the difference is due to the subdomain data being normalized with the respective local maximum absolute value instead of using the global maximum. The difference may be insignificant when considering the high-accuracy end of the plot, e.g., for ε = 10 −14 , Σ increases from 21 to 23%. However, when the tolerance is set to ε = 10 −6 , the variation of the compressed file size from 6 to 8% of the original restart file size can be considered as relatively large. Using MPI communication for parallel wavelet transform and range coding, it may be possible to achieve better trade-off between parallel speedup and compression ratio.
We follow similar procedure as in Section 4.3 to evaluate the effect of lossy compression upon restart. Taking restart files from a previous simulation as initial data, three new simulations have been performed for the period of 16:00-16:10 JST (Japan Standard Time) on August 11, 2007. The first of them resumes from the original files, while the second and the third resume from compressed data with ε = 10 −6 and 10 −12 , respectively. We use the united compressed file format that introduces no artifacts at the boundaries between subdomains. In the following discussion, we analyze the output data written on disk in the end of these simulations. Table 1 shows the residual relative error in the L ∞ and L 2 norms, respec-tively calculated as where f denotes a hydrodynamic field obtained in the reference simulation that resumed from the original restart data, andf (ε) stands for the respective field computed starting from the compressed data. To simplify the notation, f andf (ε) are treated as one-dimensional arrays.
The relative L ∞ error is of order 100% in both cases for most of the field variables, except for pressure fluctuation that reaches 14% and for the base density and pressure, both of which are constant in time and therefore remain of the same order as ε or less. The relative L 2 error is, in general, two orders of magnitude smaller than the respective L ∞ error, which means that, in most part of the domain, the pointwise residual error is much smaller than the respective peak values.
Comparing the present results with the global numerical simulation described in Section 4.3, one of the key differences is in the spatial resolution. In this building-resolving simulation, the eddy turnover time of the smallest wake vortices is of order of seconds. Therefore, after 10 minutes (i.e., by the end of the simulation), the small structures de-correlate, producing large pointwise error. Field e ∞ (ε = 10 −6 ) e 2 (ε = 10 −6 ) e ∞ (ε = 10 −12 ) e 2 (ε = 10 −12 ) fl 1 To gain better insight, let us consider a horizontal plane at 20 m altitude above the sea level. Figure 13 shows However, a careful examination reveals significant differences on a smaller scale, compare between panels (f ) and (c). To focus on such small-scale discrepancy, we calculate the difference between the velocity fields obtained with and without compression, |∆U| = |U(ε = 10 −6 )−U(ε = 0)|, and display it in Fig. 13(g), (h) and (i ) on an exaggerated color scale. The darker tone of panels (g) and (h) suggests that |∆U| is generally much smaller than U.
There are, however, many bright spots around the buildings that mark the small-scale differences in the wake. A zoom on one of these spots displayed in Fig. 13(i ) reveals that, locally, |∆U| is of the same order of magnitude as U, in agreement with the global L ∞ error evaluations shown in table 1.
In addition, the error is spatially organized in a pattern characteristic of mixing layers. The vorticity plots in Fig. 14(a), (b) and (c) show that ω z is small in the bulk of the fast current (lower-bottom corner of the subdomain), but many strong small-scale vortices are present in the mixing layer as well as in the slow current around the buildings. Although these small vortices show qualitatively similar arrangement in Fig. 14(a) as in Fig. 14(b), the exact position differs by as much as the core size. Consequently, the error |∆ω z | = |ω z (ε = 10 −6 ) − ω z (ε = 0)| is a superposition of strong well-localized peaks. It is worth mentioning that a restart with ε = 10 −12 has led to very similar results. This is expected as small-scale vortices shed from the buildings evolve rapidly and chaotically. Exact deterministic repetition of such simulation requires that the initial data be exact. On the other hand, the initial error has negligible effect on a kilometer scale.

Compression performance with consideration of computational cost
Our intention is to compress data from numerical simulations. Let us first consider a scenario when the same computer is used for the simulation and for the data compression. Writing the output data in a divided file, as explained in Section 4.4, enables parallel execution of the compression program, albeit at a cost of slight increase in the compressed file size. Simulations of-  ten use time-marching schemes, each time iteration incorporates differential operators of linear computational complexity (i.e., proportional to the number of grid points N) in the case of, e.g., evaluating derivatives using finite differences, or having an even higher complexity (e.g., N log N or N 2 ) if spectral methods are used or linear systems are to be solved. Although wavelet transform and range coding are known to be relatively expensive operations, their computational complexity is linear. This means that the number of arithmetic operations necessary to compress one snapshot of the output data is, at worst, proportional to the number of arithmetic operations required for one time iteration in the simulation. The proportionality constant may be greater than one if the simulation only uses explicit low order schemes and the right-hand side of the evolution equation is simple enough. However, typical practical problems in scientific computing are computationally more intense than compression using one pass of wavelet transform followed by quantization and range coding. Besides that, it is rare to write on disk the output after every time step. Temporal sampling is commonly used [38], which dramatically reduces the data compression cost in comparison with the simulation cost. A different scenario would be to perform a simulation on a supercomputer and compress/decompress the result files on a desktop computer. The elapsed time of data compression can be long in such situation, and that is the case we focus on in this section. For the performance analysis, we use

Performance assessment using a roof-line model
We used Intel Advisor 2019 Update 4 (build 594873) for the performance analysis. WaveRange was built using gcc 7.4.0 with -O2 level optimization and AVX2 support, under the Ubuntu 18.04.1 operating system. The xvelocity component of the HIT dataset was read from a file in the FluSI HDF5 format, compressed with ǫ = 10 −6 and written on disk also using HDF5. Then it was decompressed. The compression and the decompression programs have been profiled separately. Therefore, the results are presented in two columns in table 2 and in two panels in figure 15.
In the compression procedure, the wavelet transform is the least time consuming step. This can be explained by the relatively large fraction of vector operations in it: there are 12 vector loops and 12 scalar kernels shown in figure 15 with the red and the blue circles, respectively. Two of them achieve the L1 bandwidth bound. The need for strided memory access is the main factor that limits further optimization of the transform (these loops are marked with blue circles situated below the DRAM bandwidth line). In contrast, only 1 of the 8 range coder kernels has been vectorized, and the most time-consuming one is the function that encodes a symbol using frequencies. This explains why range coding is the most time consuming step. Quantization and other parts of the program count 2 vector loops and Circles show loops that belong to the wavelet transform routine, squares correspond to range encoding/decoding and diamonds belong to quantization/dequantization or the rest of the program. Blue markers show scalar loops and red markers show vectorized loops. The size of the marker and its line width signifies its relative contribution to the program total elapsed time: less than 1% for the smallest markers, between 1% and 15% for the medium-size markers and more than 15% for the large markers. Markers may overlap.
4 scalar kernels. In the quantization process, type conversion from double to char presents difficulties for the automatic optimization. The overall time in vectorized loops amounts to 11% of the total CPU time. Decompression with WaveRange takes longer time than compression. Although the inverse wavelet transform has 15 vector and 6 scalar kernels, its execution takes longer than the forward transform used in the compression. Range decoding is by far the most expensive part, as it amounts to 61% of the decompression CPU time. Only 1 of the 10 range coder kernels has been vectorized. The most time consuming function is the one that calculates cumulative frequency of the next symbol. Dequantization is relatively cheap: it only takes 13% of the total CPU time. The overall vector time ratio is 12%, which is nearly the same as in the compression program.

Comparison with other methods
Before comparing WaveRange with other methods, it is important to provide baselines in terms of lossless compression achievable with generalpurpose utilities. This is summarized in table 3. As in the previous section, here we use the x-velocity component of the HIT dataset.
DEFLATE [53] is an algorithm widely used for general purposes. The best compression, i.e., the smallest compressed file size, is achieved when the level of compression is set to 9. The execution time can be minimized by setting the level of compression to 1. It is customary to apply shuffling as a pre-conditioner to facilitate floating point data compression with DEFLATE. The idea of shuffling is to break apart floating point elements of an array into their mantissa and exponent components, then change the byte order in the data stream such as to place the first byte of every element in the first chunk, then the second byte of every element in the second chunk, etc. In the scientific datasets, values at neighboring points are usually close to each other. Therefore, shuffling produces a data stream that includes many continuous sub-sequences of identical entries, which DEFLATE can compress well. From this point of view, shuffling is similar to the quantization described in Section 2.2.
The last two lines in table 3 show the performance of Szip and 7-Zip. Szip [54] is an implementation of the extended-Rice lossless compression algorithm designed for use with scientific data. In our test, we activated the optional nearest neighbor coding method. 7-Zip applies the LZMA method [55], and we set the level of compression to 9, which is the maximum level. The crosses in figure 6(b) correspond to the compressed file size obtained with this method. All algorithms have been applied as HDF5 filters (HDF5 version 1.10.0-patch1), with the exception of 7-Zip, which was used as a standalone application (version 16.02).
The performance metrics are the relative compressed file size Σ as defined by (6), compression ratio r as defined by (8), compression throughput θ c = s(f )/t c and decompression throughput θ d = s(f )/t d , where s(f ) = 1 GB is the storage space required for a 512 3 array of double-precision values, t c is the CPU time for compression and t d is the CPU time for decompression. DEFLATE with the lowest level of compression reduced the file size by only a very small amount, down to 96.3% of the original size. Switching to the maximum level of compression helps to gain additional 0.7%, but shuffling the data brings a dramatic improvement, allowing to reach 81.2%, which is the best result among all lossless methods considered here. Szip is slightly less efficient from the file size reduction viewpoint, but its compression throughput is higher. However, the decompression throughput is lower for Szip than for DEFLATE. 7-Zip turns out to be less efficient by all metrics, which is not surprising, since it is not designed for use with scientific data. Overall, the compression ratio is 1.231 at most, meaning that many-fold data reduction cannot be achieved without loss of information in this example. The performance of different lossy compression methods, including Wav-eRange, is displayed in figure 16, which shows Σ, r, θ c and θ d as functions of the L ∞ error norm. Three other lossy methods have been evaluated. Scale-Offset is an HDF5 filter that performs a scale and offset operation, then truncates the result to a minimum number of bits. Scaling is performed by multiplication of each data value by 10 to the power of a given scale factor, which can be adjusted in order to reach the desired compressed file size or accuracy. SZ-1.4 [56] and ZFP [57] are two state-of-the-art scientific data compressors. The former is based on the Lorenzo predictor and the latter uses a custom transform that operates multi-dimensional blocks of small size. Both are implemented as HDF5 filters and both can operate in a fixed error mode. To obtain the plots in figure 16, we followed the same procedure as described in the previous sections: the HIT x-velocity data file was compressed, the file size was measured, then it was decompressed and the L ∞ error was measured. In addition, the execution time was measured for the compression and for the decompression separately.
Let us first discuss the compression performance in terms of Σ and r. The scale-offset compression generally produces larger files than WaveRange. Note that this method is equivalent to quantization (3) in the limit of large L ∞ error. ZPF produces slightly larger files than WaveRange, except when the set tolerance is smaller than 10 −14 and the L ∞ error of WaveRange in- creases significantly due to roundoff. The performance of SZ-1.4 varies largely depending on the control parameter configuration. In our test, we optimized it for the maximum compression ratio at a given error magnitude. Thus, the maximum quantization interval number was increased as the error decreased.
With this setting, SZ-1.4 could produce smaller files than WaveRange, but only for the relative error greater than 10 −8 . SZ-1.4 switched to lossless mode when the relative error smaller than 10 −10 was requested.
Considering the throughput θ c and θ d , ZFP was generally the fastest in our tests, although it was outperformed by SZ-1.4 in certain cases by a small amount. WaveRange was 3 times slower than ZFP for the compression and 5 times slower for the decompression. This is not surprising, considering that the transform used in ZFP was optimized for the maximum throughput at the cost of a certain decrease in the compression ratio. Interestingly, despite the relatively high algorithmic complexity, WaveRange compression was faster then the scale-offset compression.

Conclusions
A wavelet-based method for compression of data output from numerical simulation of fluid flows using block Cartesian grids has been presented. The method consists of a discrete wavelet transform, quantization adapted for floating-point data, and entropy coding. It is designed such as to guarantee the desired pointwise reconstruction accuracy. An open-source software implementation has been provided, see https://github.com/pseudospectators/WaveRange.
The data compression properties have been analyzed using example numerical tests from different kinds of problems, from idealized fluid flows to realistic seismology and weather simulations. In particular, it is found that, in the most challenging (from the compression point of view) case of homogeneous isotropic turbulence, compression allows to reduce the data storage by a factor of 3 using ε = 10 −8 , which is significantly better than storing the same data in single-precision floating-point format. The method show favorable scaling with the data size, i.e., greater compression ratios are achieved for larger datasets. Compression of the wake turbulence is also slightly better, compared to the reference homogeneous isotropic turbulence case. This is explained by greater inhomogeneity of the wake turbulence, which means that there are fewer large wavelet coefficients.
The compression performance depends on the flow type. For the realistic data generated from global and urban weather simulations, the file size can be reduced by a factor of about 15 using the threshold value ε = 10 −6 . Both simulations can be successfully restarted from the reconstructed data. The reconstruction error has shown no significant effect on the dynamics of largescale structures, which are typically the main objects of interest. It should be noted, however, that the small-scale structures may randomize very quickly if any small initial error is introduced in the simulation.

Data availability
The HIT and the wake turbulence datasets, in the compressed format, are available at https://osf.io/pz4n8/. In addition, the same HIT dataset is contained in the supplementary file data_sample.zip. Access to the weather simulation data can be granted upon request, under a collaborative framework between JAMSTEC and related institutes or universities. The computational domain is discretized using a uniform Cartesian grid consisting of 960 × 768 × 384 points. The no-slip boundary condition at the surface of the cylinder is modeled using the volume penalization method with the penalization parameter C η = 5 · 10 −4 . For more information about the numerical method, see [46].
A small cylindrical detail attached on the surface of the cylinder at the angular distance of 135 • from the front stagnation point served to quickly break the bilateral symmetry of the flow. Subsequently, random noise introduced during the startup phase provoked three-dimensional instabilities. The three components of the velocity field (u,v,w) at time t = 330 were saved, respectively, in three separate files in double precision. Each file occupied 2.2 GB of hard disk space.
The flow configuration and the result of the simulation are visualized in Fig. A.17. The wake is apparently turbulent with a variety of scales and heterogeneity reminiscent of industrial flows. Grey color shows the cylinder, cyan shows an iso-surface of the vorticity magnitude calculated using the original velocity field, and magenta shows a similar iso-surface for ε = 10 −2 . The two iso-surfaces overlap.
Similarly to the previous HIT test case, the procedure of compression with prescribed tolerance ε and subsequent reconstruction has been applied to the velocity components of the turbulent wake. Again, the relative error measured in L ∞ norm has appeared almost identical to ε, except for the smallest and for the largest ε. Fig. A.18(a) shows the compressed file size as a function of the error norm, component-wise. The compressed file size is again normalized with the original file size. The error norm is defined by (5). The compression method is equally effective for all velocity components, despite the anisotropy of the velocity field. Perfect reconstruction cannot be reached because of round-off errors, but reconstruction with 10 −14 accuracy is achieved from a compressed file slightly larger than one half of the original size. It can also be noticed that, if the velocity field were stored in a single precision file, the error would be 100,000 times larger than when storing the same field using the compressed format in an equally large file. Similarly to Fig. 6(b) for the HIT test case, the diagonal dash-dotted line in Fig. A.18(a) shows the gain in compression achieved by discarding the least significant digits, i.e., by reducing the precision of each point value. The difference between this upper bound and the actually measured file size is the combined effect of wavelet transform and entropy coding.
The compression ratio as a function of the error norm is shown in Fig. A.18(b). Greater compression ratios can be achieved when precision requirements are less stringent. For instance, in this example one can achieve 8 times reduction in the volume of data if stored with 10 −6 accuracy. For very low accuracy, the compression ratio saturates at r ≈ 400, which is close to the maximum compression ratio obtained for the HIT data.
In the intermediate range of ε, the compressed file size as a function of the relative L ∞ error can be approximated as which is shown in Fig. A.18(a) using green dots. It can be compared with Σ HIT given by (7), which is superposed on the same figure using magenta dots. The values of Σ Cyl are smaller than those of Σ HIT . This can be explained by comparing the histograms displayed in Fig. A.19. Only the first component, u, is shown for clarity. The results for v and w are similar.
To guarantee fair comparison between two histograms for the two different flow fields, u is normalized by its half-span before applying the wavelet transform yieldingũ. The same normalization is used in the data compression algorithm. As the number of bits required to represent a point value of  (7) and (A.1) that correspond to Σ HIT and Σ Cyl , respectively, are shown with the magenta and green dotted lines. The dashed vertical line shows the accuracy of single-precision storage and the dash-dot lines correspond to compression using quantization only. u, after quantization, is proportional to log 10 |ũ|, the latter quantity is used to produce the histogram. The interval between its minimum and maximum is divided in a finite number of bins and the number of point values falling in each bin is counted. Note that the maximum values are almost identical for both datasets. The result is normalized such that the area under the curve integrates to 1. By comparing the histograms for the HIT and the wake velocity datasets, one can see, for example, that the HIT field has relatively many coefficients of order of magnitude 10 −2 , but less at 10 −6 . The expected value for the HIT case is −4, whereas in the cylinder wake case it is equal to −5.1. The standard deviation is similar in both cases: 1.3 and 1.1, respectively. It follows that the HIT wavelet coefficients are, on average, almost one order of magnitude larger than the cylinder wake wavelet coefficients. Consequently, for equal compression ratio, the L ∞ error is expected to be one order of magnitude larger for the HIT data than for the cylinder wake. This is in agreement with the observed difference between the linear fits in Fig. A.18(a). In addition, the slightly larger skewness of the cylinder wake PDF explains why the difference becomes slightly smaller when the tolerance is decreased -also compare the slopes of (7) and (A.1).
WaveRange treats individual components of a vector field independently. Although it must be possible to exploit correlation between multiple scalar fields, this procedure is not straightforward. We have tested two approaches. The results are compared in table A.4. The 'Velocity -polar' method consists in transforming the velocity components u, v and w to a magnitude ρ and  The scalar fields ρ, ϑ and φ are then compressed with a prescribed tolerance ǫ.
The 'Vorticity -Cartesian' method calculates the vorticity Ω = ∇× U , which is the curl of the velocity vector U = (u, v, w), and the spatial average of the velocity U 0 . Then, the three components of the vorticity are compressed with tolerance ǫ. The velocity is reconstructed from Ω and U 0 using the Biot-Savart formula. All differential and integral operators are approximated using a Fourier spectral method. Both methods introduce error due to additional computations. For the 'Velocity -polar' method, computation entails a larger round-off error than in the original 'Velocity -Cartesian' case. The 'Vorticity -Cartesian' method involves numerical differentiation which has a truncation error. For these reasons, we select relatively large compression tolerance ǫ in order to achieve the reconstruction error of approximately 10 −4 . It is much larger than the truncation and the round-off errors.
To quantify the reconstruction accuracy of a vector field using a single scalar-valued metric, the maximum relative L ∞ error norm is selected among the three velocity components, The compressed file size Σ vec is calculated as the sum of the compressed file sizes divided by the original storage size of the three velocity components in double precision, and the compression ratio is equal to r vec = 1/Σ vec . Note that, in all methods, e vec is calculated on the reconstructed Cartesian velocity componentsǔ,v,w, but the tolerance ǫ is set on the transformed field in the 'Velocity -polar' and 'Vorticity -Cartesian' methods. Therefore, in the two latter cases, ǫ is iteratively varied until e vec becomes close to 10 −4 .
The final values of e vec are shown in the last column of table A.4, and the corresponding ǫ is in the second last column. From the values of Σ vec and r vec in, respectively, the second and the third columns in table A.4 we conclude that the original 'Velocity -Cartesian' method provides the best compression for the desired 10 −4 accuracy.
Such inefficiency of the 'Velocity -polar' and 'Vorticity -Cartesian' representations can be explained by the loss of uniformity in the spatial distribution of pointwise errors, and unequal errors ofǔ,v andw. This implies that some point values are stored with higher precision than necessary. In addition, the 'Velocity -polar' representation suffers from the discontinuity in ϑ and φ artificially introducing small scales in the field, which require more wavelet coefficients to be stored. To find a suitable vector field transform may be a promising direction for the future work.

Appendix B. Reliability of the error control
The wavelet transform offers control over the L 2 norm, but we aim for the L ∞ error control. The poinwise error in the physical space is a linear combination of the quantization errors of the wavelet coefficients within the stencil. The latter are all smaller than ε F by construction. The weights are constants specific to the CFD9/7 wavelet. The number of terms in the sum is bounded by the number of operations in the lifting steps times the number of spatial dimensions (= 3) times the number of levels of the transform (= 4). From these considerations, one may derive a theoretical upper bound on the ratio between the L ∞ error in physical space and the filtering threshold ε F , and thus obtain a theoretical estimate for the parameter η in (4). However, we do not attempt such rigorous analysis. The value η = 1.75 is an empirical constant. Moreover, it does not exactly guarantee that e ∞ = ε.
In order to gain quantitative information about reliability of the error control in WaveRange across different application scenarios, we compiled the L ∞ error data e ∞ from all examples (HIT, wake turbulence, typhoon, urban-scale and seismology simulations) with different tolerance values ε into one dataset. The error-to-tolerance ratio e ∞ /ε was then calculated for each sample in the dataset. For the HIT and the wake turbulence, e ∞ was taken separately for each velocity component. For the typhoon, urban-scale and seismology simulations, the maximum e ∞ over all fields was taken. For the urban-scale simulation, the united and the divided storage schemes were both included in the analysis. This yielded a set of 186 values of e ∞ /ε. We then discarded those samples that corresponded to ε ≤ 10 −14 , for which the error control failed because of the round-off errors. The remaining set contained 138 samples. We then calculated a histogram of e ∞ /ε. The result is shown in figure B.20. The peak of the histogram is at e ∞ /ε = 1. The expected value of e ∞ /ε is estimated as 0.93, and the standard deviation is equal to 0.20. By numerical integration of P DF (e ∞ /ε) we found that, in 83% of the cases included in the analysis, e ∞ /ε was less or equal 1. In 94% cases it was less or equal 1.2 and in no case it exceeded 1.7.