BOUNCE: memory‑efficient SIMD approach for lightweight integer compression

Integer compression plays an important role in columnar database systems to reduce the main memory footprint as well as to speedup query processing. To keep the additional computational effort of (de)compression as low as possible, the powerful Single Instruction Multiple Data ( SIMD ) extensions of modern CPUs are heavily applied. While a scalar compression algorithm usually compresses a block of N consecutive integers, the state-of-the-art SIMDified implementation scales the block size to k ⋅ N with k as the number of elements which could be simultaneously processed in an SIMD register. On the one hand, this scaling SIMD approach improves the performance of (de)compression. But on the other hand, it can lead to a degradation of the memory footprint of the compressed data. Within this article, we analyze this degradation effect for various integer compression algorithms and present a novel SIMD concept to overcome that effect. The core idea of our novel SIMD concept called BOUNCE is to concurrently compress k different blocks of size N within SIMD registers, guaranteeing the same compression ratio as scalar variant. As we are going to show, our proposed SIMD idea works well on various Intel CPUs and may offer a new generalized SIMD concept to optimize further algorithms.


Introduction
Cloud computing has become mainstream and we observe that modern data analysis as well as management continues to move to cloud environment. The biggest challenge for cloud providers of data management solutions is to make efficient use of the underlying hardware to reduce the overall costs. This holds especially for in-memory database systems in the cloud, because main memory is a driving factor for hardware costs [1]. Thus, the optimization of main memory consumption for base data as well as intermediate data during query processing in these systems plays an important role [2][3][4]. Generally, this aspect has been an integral part of in-memory database systems from the very beginning. For example, inmemory column-stores encode every base column as a sequence of integer values and the necessary memory space for storing these integer sequences is reduced with the help of some additional lightweight computations for integer compression. Various works have shown that this drastically reduces the main memory footprint [2,3]. Moreover, these compressed integer values also offer advantages for query processing [2,3,5]. However, compression as well as decompression causes additional computational effort.
To keep the computational effort as low as possible, the Single Instruction Multiple Data (SIMD) extensions of modern CPUs are heavily applied [6][7][8]. The SIMD objective is to increase the single-thread performance by executing an identical operation on multiple data elements in an SIMD register simultaneously (data parallelism) [9]. The SIMD extensions consist of two main building blocks [9]: (i) SIMD registers, which are larger than traditional scalar registers, and (ii) SIMD instructions working on those SIMD registers. SIMD instruction sets usually include arithmetic as well as Boolean operators, logical and arithmetic shifts, and data type conversions, including specific SIMD instructions to load data from main memory into SIMD registers and write it back. In the past years, hardware vendors have regularly introduced new SIMD extensions with wider SIMD registers. For instance, Intel's Advanced Vector Extensions (AVX) operates on SIMD registers of size 256-bits, while Intel's newest extension set AVX-512 uses now 512-bit SIMD registers. The wider the SIMD registers, the more data elements can be stored and processed in an SIMD register. For instance, Intel's Streaming SIMD Extension (SSE) 128-bit SIMD register can store four 32-bit data elements, AVX 256-bit SIMD registers can store eight (2×), and AVX-512 512-bit SIMD registers can hold 16 (4×) of such data elements. Consequently, the SIMD instructions operating on these wider SIMD registers can also process 2× respectively 4× the number of data elements in one instruction, which promises significant speedups.
From a lightweight integer compression perspective, the state-of-the-art utilization of these SIMD extensions works as follows [8]: While a scalar compression algorithm compresses a block of N consecutive integers, the state-ofthe-art SIMD approach scales this block size to k ⋅ N with k as the number of integers that can be simultaneously processed with an SIMD register. As shown in various papers [2,[6][7][8]10], this scaling approach increases the performance of compression as well as decompression routines. However, this scaling approach can lead to a degradation of the compression ratio compared to the scalar variant. In particular, the degradation effect grows with increasing SIMD register sizes, making the reduction in memory-space sub-optimal. In [11], we analyzed this degradation effect for a heavily-used and well-performing representative integer compression called BitPacking (BP) [7] and proposed an idea for an alternative SIMD concept called BOUNCE to overcome that effect.

Our contribution and outline
In this extended article of [11], we briefly recap this degradation effect for BP and extend this analysis for additional integer compression algorithms for generalization purposes. Afterwards, we present our alternative SIMD concept called BOUNCE for a memory-efficient SIMD approach for lightweight integer compression algorithms in more detail. The core idea behind BOUNCE is to concurrently compress k different blocks of size N within SIMD registers to achieve the same compression ratios as the scalar variant in all cases. To present our memory efficient BOUNCE concept, the rest of the paper is structured as follows: 1. In Sect. 2, we summarize the state-of-the-art SIMD approach for lightweight integer compression algorithms and we theoretically analyze the degradation effect for representative examples. 2. Then, we present our alternative SIMD concept BOUNCE in Sect. 3. As we are going to show, the foundation of our alternative concept is an appropriate data access pattern enabling fine-grained parallel, partition-based data access SIMD implementations. 3. Afterwards, we discuss the application of BOUNCE to lightweight integer compression algorithms in Sect. 4. 4. In Sect. 5, we present representative evaluation results using different hardware platforms to show the broad applicability of our BOUNCE concept. In addition to evaluation results based on synthetic data, we also present results on real data using the publicBI benchmark [12].
Finally, we discuss related work in Sect. 6 and present a summary in Sect. 7.

Analyzing state-of-the-art
The objective of lossless, lightweight integer compression algorithms is to represent a sequence of finite integer values with as few bits as possible [6][7][8]. We call this number the bit width of a value. Over the past decades, a large corpus of different algorithms has evolved [2,[6][7][8]10]. Generally, lightweight integer compression algorithms employ a subset of the following five fundamental techniques: frameof-reference (FOR) [13,14], delta coding (DELTA) [7,15], dictionary compression (DICT) [2,14], run-length encoding (RLE) [2,15,16], and null suppression 1 3 (NS) [2,7,15]. While FOR and DELTA depict each value as the difference between a specified reference value (FOR) or its predecessor value (DELTA) respectively, DICT substitutes each value with a dictionary's unique key. The goal of these logical techniques FOR, DELTA, and DICT is to convert the uncompressed data into a sequence of smaller integers that can then be compressed using the NS technique (physical technique). The idea of NS is the elimination of leading zeros in the bit representation of small integers. In contrast, RLE deals with runs that are continuous sequences of occurrences of the same value. Each run is represented by its length and value in the compressed format. Based on combinations of these underlying techniques, all lightweight algorithms have in common that the algorithms map single values or whole blocks of l consecutive values to a compressed representation of r bits and often apply a data dependent distinction of cases c. Moreover, the compressed representation consists of control patterns and data snips [8]. Data snips represent the compressed integers in binary format, while control patterns respectively descriptors store the auxiliary information to interpret the data snips. Thus, the effectiveness of a data compression process for a given algorithm and a given input data concerning the memory footprint can be indicated by the compression factor If only statistical parameters for the input data are known, it might be possible to calculate the expected compression factor. For example, if we know the probability for the number of bits needed for the binary representation of a single value and if we assume that in the input sequence is no inner data dependency, then we are able to determine the expected compression factor by the ratio of the average compressed block size in bits and the average uncompressed block size in bits. To illustrate that aspect in this article, we assume that 64-bit integer values of bit width bw occur with a probability p(bw), such that ∑ 64 0 p(bw) = 1 . Since we assume no inner data dependency in the input data, the following condition for two consecutive values holds p(bw 2 | bw 1 ) = p(bw 2 ) . Based on that, the average uncompressed block size in bits can be determined by the product sum of the probabilities p(c) of the different compression cases c and the uncompressed block size 64 ⋅ l(c): The average compressed block size in bits can be determined by the product sum of the probabilities p(c) of the different compression cases c and the compressed block size r: In the remainder of this section, we specify the Eqs. 1, 2, and 3 for different wellestablished algorithms, where the case probabilities p(c) can be determined by the probabilities p(bw) of different bit widths bw.

BitPacking
A heavily used and well-performing integer compression algorithm is BitPacking (BP) [7]. BP belongs to the class of null suppression algorithms by omitting leading zero bits [7]. This type of compression is, for example, the basis to efficiently execute scans [17,18].

Scalar version
The scalar version of BP for 64-bit integer values is called BP64 and works as follows: The input sequence of integer values is subdivided into blocks of 64 integers each. For each block, the minimal number of bits required for the largest element is determined. Then, all 64 integers in each block are stored in a data snip with the respective number of bits for each value. The used bit width is stored in a single 64-bit integer as control pattern. Other scalar compression algorithms operate in a similar way and Fig. 1 gives a schematic overview of this procedure with a block size of four.

State-of-the-art SIMD approach
The state-of-the-art SIMD approach for integer compression is characterized by the fact, that (i) the block size is scaled by the SIMD register size k-k is the SIMD register size in number of integer values and (ii) the application of the compression on this larger block sizes as depicted in Fig. 1   able to store and process 4 64-bit integer values at once, so that the block size is scaled by k = 4 . Based on that, the k-way scaled SIMD block contains 256 integer values and for these elements, the minimal number of bits required for the largest element is determined. Then, all 256 integers in each block are stored in a data snip with that many bits for each value and the used bit width is stored as common control pattern. The compressed values in the data snips are organized using a k-way vertical layout distributing N consecutive integers to k different groups [7]. SIMD-BP256 offers superior performance compared to other compression algorithms [7].

Analyzing memory footprint
A main drawback of the k-way scaling is that the compression factor mostly increases with an increasing SIMD register and block size. On the one hand, fewer control patterns need to be stored due to larger block sizes. On the other hand, a number of larger integer values may be compressed with a larger bit width. To precisely analyze this effect, we derive the expected compression factor for different integer bit width distributions.
Our analysis framework works as follows: The scalar algorithm BP64 encodes blocks of 64 64-bit values with the least possible common bit width and the bit width as control pattern itself with 64 bits. The SIMD-based implementations with SIMD register size k encode 64 ⋅ k values with the same approach. Given is a data distribution for 64-bit integer values characterized by the probability for the bit widths 0 ≤ b ≤ 64 ∶ p(b) and an SIMD register size k. Now, we can distinguish 65 cases corresponding to blocks of 64 ⋅ k values that are encoded with bit width 0 ≤ b ≤ 64 . Each of these cases (i) occurs with a probability p � (b, k) , which depends on the given data distribution and the SIMD register size k, and (ii) is characterized by a block compression factor cf � (b, k) . The expected compression factor for a k-way SIMD-based implementation of BP64 can be calculated by The block compression factor cf � (b, k) is given by and the block probability can be derived by the following consideration. The probability of the occurrence of a block of size 64 ⋅ k containing only zero values is p � (0, k) = p(0) 64⋅k . The probability of a block encoded with one of the bit widths . The probability for the occurrence of a block encoded with bit width b is the difference of the above probability and all probabilities for the occurrences of blocks with a smaller bit width than b: For the compressed size ratio between a k-way SIMD implementation and the scalar implementation of BP, we calculate cf (k) cf (1) with cf(1) corresponding to the scalar compression factor (scaling factor k = 1).
In the following, we apply these formulas on two different data distributions where most integer values are characterized by a bit width of 2, but we also have a probability x for integer values with a larger bit width. While in the first case the larger bit width is 3, the bit width in the second case is 60. Figure 2a and b depict the compressed size ratio cf (k) cf (1) , k = {2, 4, 8, 16, 32} for the SIMD implementations for both cases and different probabilities. As we can observe in Fig. 2a, all lines are below 1 for case one with a larger bit width of 3. That means, each SIMD implementation (using different k-way scalings) has a lower compression factor than the scalar algorithm. Thus, the memory footprint is further optimized compared to the scalar variant. The reason is the lower number of control patterns for larger blocks and the more or less homogeneous bit widths for all integer values. Moreover, the value k for the best SIMD implementation yielding the best compression ratio depends on the probability of the larger bit width.
In contrast to that, for the second case with the larger bit width of 60 and lower probabilities for such integer values with that larger bit width, we see that all lines are above 1 (cf. Fig. 2b). This means, the compression ratio of the scalar variant is much better than of the SIMD implementations. For example, a compressed representation of 8-way SIMD implementation-SIMD register size 512-bit with 64-bit integer values-is 4 times larger than for the scalar variant. Those disturbing effects happen and destroy the advantages of the SIMD-based integer compression, especially since a small number of values with larger bit widths has such large effects.
Finally, we examined a variety of larger bit widths with a fixed probability of p = 0.001 and different values of k. Again, most integers are characterized by a bit

Simple algorithms
As BP, the family of Simple also belongs to the class of null suppression algorithms. The algorithm Simple9 [19] was developed for the compression of inverted indices, whereby the uncompressed input was considered as a sequence of 32-bit integer values. Here, several variants exist. We focus on Simple-8b [20] for 64-bit integer values.

Scalar version
For Simple-8b, the input sequence is subdivided into blocks of variable length l, such that all values of a block (i) are encoded with the same bit width bw and (ii) the overall size of a compressed block does not exceed a size of 60 bits, such that each block is encoded using one 64-bit word. Thus, we have 16 different possibilities c scalar , how to encode the values (60 values with bit width 1, 30 values with bit width 2 and so on). Those cases are encoded by the numbers 0-15. Their binary representation takes advantage of the 4 remaining bits per 60-bit word. Here, 14 different encoding cases are specified for blocks containing not only zero values and two cases are specified for blocks containing exclusively zero-values. The different cases are shown in Table 1. The scalar data compression process is sketched in Fig. 3.

State-of-the-art SIMD approach
The state-of-the-art SIMD variant of Simple9 is called GroupSimple [8] and compresses 32-bit integer values. Thus, a compressed block has a size of 32 bits. Here, the input sequence is subdivided in two dimensions. First, it is divided into blocks of k values and each block is mapped to its maximum value. This is shown in Fig. 3. Hence, a new sequence called MaxArray is used to determine the encoding case for a matrix of l ⋅ k values. Because one control snip (binary representation of the case c) corresponds to k compressed blocks containing l values, it is stored separately with 4 bits and all 32 bits of the block can be used for the data snip. This approach can easily be mapped to a GroupSimple-8b variant to compress 64-bit values. Here, we use the adapted encoding scheme pictured in Table 1. In example for k = 4 , if we have a sequence with 21 ⋅ 4 values and the first value has bit width 3 bits and all others bit width 2, we have to use case 3 (group size 21 and bit width 3) to encode the 4 blocks. In comparison, the scalar algorithm would encode the first 30 values with case 4 (group size 20 and bit width 3) and the next 30 values with bit width 2, which might end with a lower compression factor.

Analyzing memory footprint
The compression factor for the scalar and the SIMD version of Simple-8b can be calculated by where c is the case selector, p(c, k) the probability of the selected compression case c for given k and l(c) the compressed data size l scalar or l SIMD for one block in dependence of the case c. The probability of case 0 is p(0, k) = p(0) 120⋅k . For all other cases holds that the next l c values have at most bit width bw(c), but it is not possible to apply a case c ′ < c . This can be calculated by We apply these formulas to the same data distributions as for BP: Most values are characterized by a bit width of 2, but we also have a probability x for integer values with a larger bit width. While in the first case the larger bit width is 3, the bit width in the second case is 60. Figure 4a and b depict the compressed size ratio cf(k)/cf(1) for different k for the SIMD implementations for both cases and different probabilities. In Fig. 4a we can see, that for p(3) = 0.01 and k = 8 we the SIMD variant needs nearly 20% more space for the compressed data than the scalar variant. But width increasing probability p(3) > 0.05 , the SIMD variants are characterized by a better compression factor than the scalar variant ( cf (k)∕cf (1) < 1 ). Figure 4b depicts the compressed size ratio for a larger bit width of 60. Here we also see, that the scalar variant has a better compression factor than the SIMD variants for small occurrence probabilities of values with bit width 60. For example, if values with bit width 60 occur with a probability of p(60) = 0.05 and we use k = 8 , the scalar variant needs only 1/6 of the space which is used for the SIMD compression.
As shown in Fig. 4c, we examined a variety of larger bit widths with a fixed probability of p = 0.01 and different values of k. Again, most integers are characterized by a bit width of 2. As we can see, the memory footprint of the SIMD compression is worse than for the scalar variant. Again, the factor grows with increasing larger bit widths and larger k.

Varint algorithms
Varint is another family of compression algorithms initially developed for 32-bit values [21]. They variably shorten the binary representation of single values to a number of bits which is a multiple of a fixed number of bits-in example 8 bits per unit u-and additionally store the number of units, such that the length of a compressed number is preserved. Here, we focus on the SIMD version of the algorithm Varint-GB for 64-bit integer values and call it Varint-GB64.

Scalar version
In the scalar version, it is possible to encode a 64-bit value with 1-8 Bytes by the omission of leading zero bytes. The length descriptor contains a value from 0 to 7 and is stored in its binary representation with 4 bits. In the example in Fig. 5 the value is characterized by 5 leading zero Bytes and 3 Bytes containing also 1-Bits. Thus, the descriptor 3 is encoded as 0011, such that two descriptors fill exactly one Byte. The descriptors are stored at a separate place.

State-of-the-art SIMD approach
Like Simple, the input sequence is divided in blocks of k consecutive values. The number of bytes used to encode each of the k values is determined by the maximum of the k values [8]. Thus one length descriptor belongs to k consecutive values and is stored at a separate place. In the example in Fig. 5 with k = 4 the second value contains only 3 leading zero Bytes. Hence, all of the 4 values are encoded with 5 Bytes-the descriptor 5 is encoded as 0101.

Analyzing memory footprint
The uncompressed input size is always k ⋅ 64 Bit. The expected compressed output size depends on the probabilities of the bit widths. The k values are encoded with u units, if the largest value is characterized a bit width bw Thus, the compression factor can be calculated by where k ⋅ 8 + 4 is the number of bits needed to compress one single value with u units for given k, p(u, k) is the probability that a block of k values is encoded with 8 ⋅ u bits, and p(bw) is the probability for a value of bit width bw.
In Fig. 6a we compare the k-way compressed data size with the scalar compressed data size, where the probability of a value of bit width 10, which is encoded with 2 Bytes and additional descriptor bits varies from 0 to 1. All other values are of bit width 2 and are encoded with 1 Byte and up to four additional descriptor bits per value. For the probability up to p(10) > 0.5 the scalar algorithm is delivers a worse compression factor than all of the SIMD variants. But for example or k = 8 and a occurrence probability of values with bit width 10 of p(10) = 0.2 , we accept an around 12% larger compressed data size. For p(60) in the same setting in Fig. 6b, we accept an around 140% larger compressed data size. As can be seen in Fig. 6c for increasing larger bit widths in this setting, the SIMD variants are characterized by a significant larger compressed data size.

Further compression algorithms
So far, our analysis has focused on algorithm from the class of null suppression, but the analysis can be applied to the other techniques as well. Frame of Reference (FOR) encodes each value of a block as the difference to the minimum and is often combined with BP. Thus, the bit width and the minimum have to be stored for each block. Similar to BP, the SIMD versions scale the block size by k. Here, the difference between the maximum value and the minimum value increases compared to the scalar variant, which-depending on the data distribution-often leads to a larger needed bit width for each block and thus, a worse compression factor. In the example inf Fig. 7, the block size of 64 in the scalar case scales to k ⋅ 64 . As a consequence, the 64 first values cannot be encoded with bit width log 2 (5 − 2) = 2 . All first k ⋅ 64 value have to be encoded with bit width log 2 (8 − 1) = 3.
Run length encoding (RLE) is a compression technique which encodes each subsequence of same consecutive values as a tuple of run length (number of values) and run value. Because of inner data dependencies, the state-of-the-art scaling SIMD  [16]. In general, the performance of those SIMD algorithms increases marginally compared to the scalar algorithm.
DELTA encodes each value except the first one as the difference to its predecessor. The scaling SIMD approach can not be directly applied, but there exist different SIMD versions. For instance, it is possible to encode k consecutive values as the difference to their kth predecessors [6]. This is shown in Fig. 8. Obviously, we have larger differences between a value and its kth predecessor in the SIMD algorithm compared to the direct predecessor in the scalar algorithm. Thus, the subsequent application of a null suppression algorithm-which is usually done to eliminate the leading zeros-will lead to a worse compressed data size.

Summary
In this section, we analyzed the compression factors for different scalar and stateof-the-art scaling SIMD algorithm variants. As we have shown, the resulting compressed data sizes of the scaling SIMD variants are often larger than for scalar algorithm variants. This holds for the algorithms from the class of null suppression algorithms as well as for FOR or DELTA in combination with a subsequent null suppression. Moreover, the compressed size degradation effect increases with wider SIMD registers.

BOUNCE: block concurrent SIMD concept
As clearly described in the previous section, traditional scalar lightweight integer compression algorithms usually subdivide an input sequence into blocks of N consecutive values and compress each block separately one after another. Based on that scalar processing foundation, the state-of-the-art SIMD approach scales this block size to k ⋅ N with k as the number of integers that can be simultaneously processed with an SIMD Fig. 8 Comparison of the scalar, block-scaling SIMD and BOUNCE DELTA compression concept register [8]. On the one hand, this scaling SIMD approach increases the performance of the compression routines mainly to fact that only a contiguous-also called lineardata access pattern is required for the implementation [2,[6][7][8]10]. On the other hand, the scaling SIMD approach also affects the compression result, especially the size. Depending on the data, the compression result can be larger or smaller compared to the scalar variant. This additional data dependency makes the use of such algorithms difficult and therefore it would be desirable to have an alternative SIMD approach that does not have this property.
To develop an alternative, memory-efficient SIMD approach, we have to look in particular at the data access pattern. Contrary to scalar processing, SIMD registers have explicitly to be populated using specific SIMD instructions such as LOAD and GATHER. A linear access pattern-as used in the scaling SIMD approach-is conducted with the LOAD instruction and requires that the accessed data elements are organized as contiguous sequence which is given for integer compression. The linear loading of data elements into SIMD registers is usually considered the baseline. The GATHER instruction reflects the alternative for non-contiguous memory access, i.e. data elements are distributed over the memory, and the common guideline is to avoid GATHER if possible due to a significant performance penalties compared to a LOAD.
However, this guideline does not always hold, as we have experimentally shown in [22]. The outcome of our comprehensive evaluation was that SIMD registers can be populated with data elements from non-consecutive memory locations using GATHER with (almost) the same performance as with data elements from consecutive memory location using LOAD in single-threaded as well as multi-threaded environments. To achieve that, the GATHER requires a proper access pattern called partition-based data access as illustrated in Fig. 9. The essential properties ( P 1 and P 2 ) of this proper access pattern are: (P 1 ) The input data (sequence of consecutive values) is logically partitioned into segments, with a segment size in Bytes being calculated by:  Fig. 9 Illustration of the partition-based SIMD processing concept using SIMD register size of k = 4 Here, k is the number of SIMD lanes of the underlying SIMD register according to the used data type and page size is determined by the underlying operating system (usually 4KiB). That means, every segment consists of k pages.
(P 2 ) The logical segments are successively processed and the employed access pattern within each segment is a strided access pattern with a stride size in Bytes of: A strided access pattern is a special case of a non-contiguous data access with a well-defined and predictable behavior. It realizes an equidistant data access, i.e. there is a constant (but configurable) distance between accessed data elements in a contiguous sequence [23,24]. The distance is called stride size. Based on that, each SIMD lane is responsible for exactly one disjunct page within each segment. Within the pages, the data elements are accessed linearly, so that all data elements will be processed. If a segment is processed, the following segment is used. The advantage of this partition-based data access pattern is that it enables fine-grained, partition-based SIMD implementations with the same access performance as linear SIMD implementations. Based on that foundation, we proposed a Block c Onc Urre Nt Conc Ept (BOUNCE) as alternative SIMD approach for lightweight integer compression algorithms. In BOUNCE, the input sequence is implicitly partitioned into segments and each segment contains k pages, where k is the number of SIMD lanes of the underlying SIMD register according to the used data type. The pages are further implicitly partitioned into blocks according to the given scalar compression algorithm. Then, the scalar compression algorithm is applied to k disjunct blocks from the k pages within one segment. After the first k blocks are compressed in parallel within an SIMD register, the next k blocks are compressed. This is repeated until all blocks are compressed.
While the state-of-the-art scaling SIMD approach scales the block size by a factor of k, we scale in BOUNCE the number of concurrent scalar compression routines on disjunct blocks by a factor of k and keep the block size the same as illustrated in Fig. 1. That means, the number of SIMD lanes determines the number of blocks that will be compressed concurrently within an SIMD register. The locations of the blocks are determined by the number of SIMD lanes and the used page size of the underlying operating systems. In general, the advantages of BOUNCE are: (i) the block size can be chosen arbitrarily, so that e.g., the block size of the scalar compression can be maintained and (ii) the control as well as data snips are calculated lane-wise. That means, we should be able to apply the scalar compression algorithm on each SIMD lane on different data blocks concurrently with BOUNCE which guarantees the same compression ratios as the scalar variant in all cases. Thus, our novel BOUNCE concept is more memoryefficient than the state-of-the-art SIMD concept. stride size = page size ; 1 3

BOUNCE application
While we introduced our general BOUNCE approach as alternative SIMD concept for lightweight integer compression algorithms in the previous section, this section describes its concrete application to various algorithms. In particular, we describe the BOUNCE-implementation of BP in more detail, since the implementations of the other algorithm are similar.

Application to BP
As introduced in Sect. 2.1, the scalar variant of BitPacking (BP) for an input sequence of 64-bit integer values partitions the input sequence into blocks of 64 consecutive values. For each block, the minimal number of bits required for the largest elements is determined. Then, all 64 values in each block are compressed in a data snip with the respective number of bits for each values. The used bit width is stored in a single 64-bit integer as control snip.
The BP compression with BOUNCE is illustrated in Fig. 10. Again, we assume integer values of size 64-bit and the assumed SIMD register size k is 4, so that four different data blocks of 64 values are compressed concurrently. To simplify our description, the page size is set to the block size of 64 values (4096 Bytes). That means, we are processing 256 integer values in total per segment of the input sequence of integer values and produce four control patterns as well as four data snips as compressed output per segment. The BOUNCE-BP compression is done in two phases, thereby each phase iterates over all 256 integer values. In the first phase, the bit width of the largest integer value within each block is determined, while the second phase uses the determined bit widths to shorten the values accordingly and to write out the compressed output. As shown in Fig. 10, we distinguish (i) a preprocessing step to load the data from the input into the vector registers, (ii) a computation step to shorten the values, and (iii) a postprocessing step to write the data into the output area. These steps are executed in each phase and each phase executes 64 iterations for 64 values per block. That means, for the nth iteration, we require the integer value of the nth position of each considered data block in the SIMD register. Assuming that the input sequence contains correct ordered data (horizontal data layout) [6][7][8], we use an GATHER instruction to load the corresponding values of the different data blocks into the SIMD register. In particular and according to our BOUNCE concept, we realize a strided access with a stride size of 64 elements.
In the computation step, we apply the appropriate SIMD functions for each phase. In the first phase, we apply the SIMD functions to compute the number of leading zeros for the largest value per lane (per block). Based on the number of leading zeros, we compute the minimal number of bits for the compression. This bit widths are used in the second phase to concatenate the shortened data values while using an appropriate SIMD-bitshifting instruction, which can be applied for each lane individually. Because in the single lanes, the data is concatenated with a different bit width, the lanes are filled at different loop passes. For example, a lane is full after 3 iterations for a bit width of 30 (assuming 64 bit integer values), but for a bit width of 2, we need 32 iterations to fill a lane. In any case, if one of the lanes is full, it has to be written to the output (postprocessing step). Here, we see two alternatives. The first alternative is to use an COMPRESSSTORE instruction to consecutively write out lanes as soon as they are full. In this case, data snips of the different blocks are intertwined. The second alternative is to use an SCATTER instruction. Since the bit width for each block is determined at first, the bit widths can also be used to calculate the position for each full lane in the output. In this case, we are able to organize the data snips for each of them in a consecutive manner.
The decompression routine of BOUNCE-BP can be built nearly straightforward the other way around. The preprocessing starts with loading of the control patterns by the application of an LOAD instruction and they are required for the correct decompression. For the preprocessing of the compressed values, the instructions complementary to GATHER and COMPRESSSTORE-namely SCATTER and EXPAND-are applied to load the compressed data snips into the vector registers. During the decompression computation step, the compressed data snips are expanded to 64-Bit integers by prepending leading zeros. For the postprocessing, an SIMD-scatter instruction complementary to the GATHER in the preprocessing phase of the compression is applied. For simplicity, the GATHER/SCATTER alternative might be preferred. The COMPRESSSTORE/EXPAND requires a calculation intensive mapping from consecutive compressed data to the different lanes. This can be avoided by decompressing the data from the last to the first value-at any point in time one or several lanes are discharged, the next one or several 64-bit words holding compressed data are accessed via EXPAND and loaded in the discharged lanes.

Application to simple algorithms
As shown in Fig. 3, our BOUNCE concept can be applied to the family of Simple algorithms as well. Here, the page size determines which finite subsequence of the input has to be processed by which lane. Here, each subsequence might be subdivided into several data blocks which are compressed with 64 bits each. Two iterations per segment should be used. In the first iteration, the specific cases of Table 1, that have to be used to encode the block size and the bit width, are determined and stored as lists in one of k case buffers. Afterwards it is known which amount of space is needed for each lane to store its compressed data and thus, where to start writing the k compressed pages in parallel. In the second iteration, the calculated cases corresponding to a values of a block in a page are used for a parallel encoding. In contrast to BP, the membership of the k gathered values to a block changes for each lanes independently.

Application to varint algorithms
For the family of Varint algorithms, each value is a single block. For each of the k gathered value in BOUNCE, a case is calculated representing the number of Bytes used to encode the value like it is shown in Fig. 5. Here, different data formats could be applied. For a register size of 512 bits for k = 8 64-bit values, it would be possible to use one register to store k ⋅ 16 descriptor values with k ⋅ 16̇4 = 512 bits. Those bits could be stored separately, such that we have two memory regions for the output or interweaved-512 descriptor bits for 128 values followed by 128 values of variable lengths.

Application to further algorithms
The BOUNCE version an FOR-BP algorithm (cascade of frame-of-reference and BP) does not distinguish in general from BP. Here, the scalar block size is preserved as well. Minimum value and bit width per block are stored as k-way SIMD data. This is shown in Fig. 7. For RLE, we need-similar to the Simple algorithms-a run length buffer, because there is no uniform block size and it varies per lane. Thus, after the run lengths per page of the whole segment are determined in a first iteration, we know the number of runs per lane and hence the start addresses for the compressed page data. Afterwards, the first value of each run is accessed by the accumulated run lengths per lane and both values can be written to the correct output address. With a strided access, a BOUNCE-DELTA compression is possible. Here, the k first values per block are gathered and stored. All other values per page are gathered and encoded as the difference to their direct predecessors. This procedure is highlighted in Fig. 8.

Evaluation
To evaluate whether and when our BOUNCE concept for lightweight integer compression is suitable, we start with a description of our overall evaluation setup. Afterwards, we evaluate our underlying partition-based data access pattern to show that this access pattern delivers good performance characteristics. Then, we present in-depth evaluation results for BitPacking as an representative example. The other algorithms give similar results.

Evaluation setup
In general, all algorithms are implemented in C++ for the data type uint64_t. That means, we focus our overall evaluation on 64-bit data elements. The results for 32-bit data elements are comparable. Moreover, the SIMD variants of the algorithms are explicitly SIMDified using AVX512 intrinsics with 512-bit wide SIMD registers. Thus, a 512-bit SIMD register can hold 8 64-bit integer values. Our implemented algorithms are single-threaded and all implementations were compiled using g++ (version 9.3.0) with the optimization flags -03 fno-tree-vectorize -mavx512f -mavx512cd. We conducted our evaluation on three different Intel CPUs with three different architectures as shown in Table 2 running an Ubuntu Linux. Thus, we also specified the corresponding compiler flag -march for each CPU. Moreover, all experiments happened entirely in-memory with input sequences containing randomly generated values, and were repeated ten times; we report the averaged results.

Performance of the data access pattern
Generally, a drawback of our BOUNCE concept may be the utilization of expensive SIMD instructions like GATHER, SCATTER, or COMPRESSSTORE. Thus, the performance will probably be worse compared to the state-of-the-art scaling SIMD approach. To overcome that, there are enough optimization knobs, hence we haven taken a closer look at one knob as an example, which we already evaluated in more detail in [22]. To be self-contained, we include a specific evaluation result in this article.
For example, for our BOUNCE-BP compression, the integer values from different memory regions must be loaded twice into SIMD registers using GATHER instructions according to our partition-based data access pattern. In contrast to that, the state-of-the-art scaling SIMD approach also loads values twice but always consecutively by means of a LOAD instruction. However, our underlying partitionbased data access pattern specifies two concrete properties P 1 and P 2 as described in Sect. 3 to satisfy performance aspects. To validate these properties, we implemented micro-benchmarks executing a sum-aggregation (mainly to focus on reading with little computation and one write operation with the sum at the end) over a 4 GiB input array of randomly generated 64-bit unsigned integer values. We implemented this sum-aggregation as (i) scalar variant, (ii) an SIMD variant using the LOAD instruction (linear access pattern), and (iii) an SIMD variant using the GATHER instruction according to our partition-based data access pattern with a variable stride distance (property P 2 ). The underlying page size was always 4 KiB, which is state-of-the-art for today's systems. We executed these micro-benchmarks using AVX-512 on our different Intel platforms as depicted in Table 2. In each diagram of Fig. 11, the stride size in terms of number of data elements (power of 2) is shown on the x-axis and the throughput in GiB/s on the y-axis. As we can see, the curves are similar on the different platforms. The scalar variant achieves always the slowest throughput, while the SIMD with the linear access pattern always achieves the best throughput (expected behavior). The GATHER-variant is in one stride size range much worse than the scalar variant, but for certain stride sizes it comes very close to the SIMD-variant with the linear access pattern. Especially, a stride size of 64 values ( 2 6 ) is in the very unfavorable range, where the achievable throughput is much lower compared to the scalar variant. However, very good throughput values are achieved for the stride sizes around 512 values ( 2 9 ). These well-performing stride distances match the page size of 4 KiB ( 2 9 ⋅ 64-bit), so that all SIMD lanes in BOUNCE load integer values from different pages that are cached after the first access and thus can be accessed optimally afterwards. We conclude, the underlying data access pattern of BOUNCE delivers very good performance results.

Evaluating BitPacking
To evaluate whether and when our BOUNCE concept for lightweight integer compression algorithms, we implemented BP as a representative example for 64-bit integer values in its scalar form (denoted as BP64), and compared it with the (c) Xeon Gold 6240R Fig. 11 Performance evaluation results for different data access pattern on different Intel hardware platforms; AXV512 with 64-bit data elements state-of-the-art SIMD approach (called SIMD-BP512) 1 and with our novel BOUNCE concept (denoted as BOUNCE-BP) using Intel's latest SIMD extension AVX-512. For all variants, we implemented the compression as well as decompression routines. For BOUNCE ( k = 8 ), we implemented all possible variants as described in Sect. 4, but the evaluation in this article focuses only on the variant using GATHER and SCATTER instructions. The state-of-the-art SIMD implementation uses LOAD and STORE operations with a block scaling factor of k = 8 . We ran this evaluation on an Intel Xeon Gold 6240R (Cascade Lake architecture; the newest of the three CPUs) with 768 GB main memory capacity (cf. Table 2). In all cases, we report the compression ratio 2 and the performance in million integers per second (mis) for the (de)compression, so that higher values are always better.

Synthetic data sets with fixed bit widths
In the first set of experiments, we created different synthetic data sets with randomly generated unsigned integer values. Each data set only contains values of a fixed bit width and the number of unsigned integer values was set to 8′192′000 (62.5 MiB). Then, we applied all compression as well as decompression routines and the results are shown in Fig. 12. As we can see in Fig. 12a, these data sets are perfectly suited for the state-of-the-art SIMD approach, because they achieve higher compression ratios and higher performance for compression as well as decompression. However, the BOUNCE-BP compression implementation (cf. Fig. 12b) with a stride distance of 512 (property P 2 ) closely matches the performance of the state-of-the-art SIMD approach, while a stride distance of 64 has a similar performance as the scalar variant. This clearly shows that the optimization of the stride distance is very important, but this is not in the focus of this paper. Decompression speeds as illustrated in In the second set of experiments, we again generated synthetic data sets with randomly generated unsigned integer values but in this case with a fixed bit width of 10-as a representative bit width-and varied the number of unsigned integer values from 81′192 (0.62 MiB) to 8′192′000 (62.5 MiB) to investigate the impact of the data size. The achieved results are depicted in Fig. 13. As expected, the stateof-the-art SIMD approach for compression as well as decompression is better suited for a small number of integer values than our BOUNCE approach. The reason for this is thar as long as the complete data (data set itself including compressed representation) fits completely into the cache, the linear access using the LOAD instruction of the state-of-the-art SIMD approach is faster than the GATHER instruction of BOUNCE. Thus, we conclude that BOUNCE-BP is better suited for data sets whose size is larger than the cache size is.

Synthetic data sets with different bit widths
In the third set of experiments, we created different synthetic data sets similar to the setting in Sect. 2. That means, the integer values are mainly characterized by a bit width of 2, but we have a probability of p(bw) = 0.001 for integer values with a larger bit width bw. We varied this larger bit width bw from 3 to 64 for the data sets and the results are shown in Fig. 14. Moreover, all data sets had again 8′192′000 million unsigned integer values. As we can see in Fig. 14a, the BOUNCE-BP   compression achieves much higher compression ratios resulting in a smaller compressed output. Since we have to write out less, the performance for compression also improves as depicted in Fig. 14b. In this case, BOUNCE-BP with a stride size of 512 values (4 KiB) clearly outperforms the state-of-the-art SIMD implementation and the speedup increases with increasing bit widths. Moreover, BOUNCE-BP with a stride size of 64 values is slightly better than the scalar variant which shows again the importance of the stride size optimization. The decompression performance is similar to the previous set of experiments as the decompression is not optimized yet.

Impact of different hardware architectures
In the fourth set of experiments, we repeated the third set of experiments on the remaining CPUs to investigate the impact of the hardware. Figure 15 shows the achieved compression performances for the different hardware platforms. The Xeon Phi 7250 CPU has the second-generation MIC architecture from Intel called Knights Landing (KNL), where AVX512 was available for the first time. As shown in Fig. 15a, BOUNCE-BP is faster than the scalar variant but slower than the stateof-the-art SIMD variant on that CPU. In contrast to that, BOUNCE-BP clearly outperforms the state-of-the-art SIMD variant on the Xeon Gold 6240R (cf. Figs. 14b and 15c). The Xeon Gold 6240R has a Cascade Lake architecture and is thus two generations older than the KNL. The results indicate that the implementations of GATHER and SCATTER instructions have been improved. This also shows the results on the Xeon Gold 6126 (Skylake architecture) in Fig. 15b. The Skylake architecture is newer than the KNL architecture, but older than the Cascade Lake architecture. Here, the achieved performances of BOUNCE-BP and the state-of-theart SIMD variants are on-par. That leads to the conclusion, that our BOUNCE concept profits more from newer Intel architectures.

Evaluation on real data sets
In our last set of experiments, we used 1′332 columns from 31 randomly selected tables of the real-world publicBI benchmark. 3 We preprocessed the columns in the following way: (1) 5) all columns with other data types are dictionary encoded, such that more frequent values are mapped to smaller numbers. From the data size perspective, 228 of the 1332 columns are of the size that fits into the L3 cache of 35.75 MiB of the used hardware platform (denoted as small columns), while the rest has a larger size (denoted as large columns). For each preprocessed column, we measured the compressed size, the compression time, and the decompression time for BOUNCE-BP (stride size of 512) and SIMD-BP512. In Fig. 16, we relate the ratio of compression sizes and compression times respectively decompression times of the two compression approaches. We show the relative compressed sizes of the columns for BOUNCE-BP512 at the x-axis. The value 1 is the relative size of SIMD-BP512. This implies that lower numbers are better in terms of compressed data sizes. And we show the relative (de)compression times for BOUNCE-BP512 at the y-axis. The value 1 is the (de)compression time of SIMD-BP512. This implies that higher numbers are better in terms of (de)compression times. As we can see, the resulting compressed sizes for almost all columns with BOUNCE-BP512 is lower and thus, more memory-efficient. The (de)compression performances are often lower for the smaller columns. For the larger columns, the compression performances are sometimes higher than those of SIMD-BP512, but often we achieve 90-100% of SIMD-BP512. The decompression performance for smaller columns is even worse than for compression. For the decompression performance for larger columns, BOUNCE-BP512 achieves around 80-85% of SIMD-BP512. Thus, our results stemming from real data confirm the results derived from synthetic data sets and clearly show the advantages of BOUNCE-BP.
(a) Size vs. compression time.
(b) Size vs. decompression time. As stated in [6], there is no single best lossless lightweight compression algorithm suitable for all data and hardware characteristics. To use data compression in mainmemory database systems, we first have to provide hardware-tailored algorithms and second have to choose one that meets the optimization criteria like memory footprint or performance for specific workloads. A comprehensive overview of the field of lossless lightweight integer compression algorithms and SIMD implementations is given by the following papers [6][7][8]25]. Each of the algorithms is implemented manually or generated for a specific scenario concerning the data characteristics and the specific hardware. There is no consequent distinction between a compression concept and the implementation. Thus, we presented a meta-model to specify integer compression algorithms in a descriptive and abstract way with the ability to derive executable code from that description [26] with the goal to generate hardware-tailored implementations of compression algorithms. An integration of our presented generalized SIMD approach into the transformation to the executable code is in the focus of our ongoing research activities.
Moreover, the selection of the best-fitting integer compression variant is a research field with a very dynamic development [5,6]. With our alternative generalized SIMD approach BOUNCE, we extend the variety of variants increasing the importance of the selection. From a SIMD execution point of view, our presented BOUNCE concept is in line with the idea of sharing vector registers for concurrently running queries as described in [27]. Nevertheless, the application as well as the specific challenges differ. However, both approaches show that an alternative use of SIMD execution can be profitably employed.

Conclusion
Integer compression plays an important role to reduce the memory footprint and to speedup query processing in column-stores. While a scalar compression algorithm usually compresses a block of N consecutive integers, the state-of-the-art SIMD implementation usually scales the block size to k ⋅ N with k as the number of elements that could be simultaneously processed in a SIMD register. However, this means that as the SIMD register size increases, the block of integer values for compression also grows, which can have a negative effect on the compression ratio. In this paper, we analyzed this effect and showed that the compressed output could be many times larger than the result of a scalar implementation. To overcome that, we presented an alternative SIMD concept called BOUNCE which concurrently compress k different blocks of size N within SIMD registers of size k. Due to the promising results for the heavily used integer compression representative BitPacking, we want to intensify our work in this area. In particular, we will further optimize our concept in combination with investigating different integer compression algorithms.
In general, BOUNCE can lead to more responsible usage of main memory resources which is necessary for cloud environments.
Author contributions JH, DH, and WL wrote the main manuscript and prepared the figures. All authors reviewed the manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL.

Competing interests
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.