Introduction

Due to ubiquitous mobile phones, Internet of Things (IoT) devices, and wireless sensor networks, the size and number of data collected by them has grown exponentially in recent years. Big data era generates many challenges for data process and storage, and the core problem is how to effective compress many kinds of data, e.g.,  text, image, video, genomic data, and 3D VR data. In this paper, we only consider the lossless data compression, where the reconstructed message is identical to the original message. In Shannon’s theory, the minimum compression rate that can be reached by the lossless compression is the entropy of information. Under lossless compression paradigm, a lot of algorithms [1] have been invented, e.g.,  Huffman coding, and LZ77/78. While, Arithmetic Coding (AC) [2] is the state-of-the-art entropy coding algorithm in terms of compression rate.

AC codec consists of two parts: a probability estimator, which estimates each symbols probability; an AC model, in which encoder repeatedly maps input symbol sequence into a subinterval between the unit interval [0,1) based on the cumulative probability of symbols; and decoder use the code (a real number between interval [0,1)) to output the reconstructed symbol sequence based on the cumulative probability of symbols. An accurate estimator is the key for the AC codec.

With the great success of deep learning in recent years, Recurrent Neural Networks (RNN), which is suitable to model the temporal dynamic behavior, has also been considered in the design of AC-related algorithm. [3] combined character-level predictive RNN with statistical coding techniques for text compression. [4] used RNN as a context mixer to mix the opinions of the multiple compressors, and achieved better compression performance. CMIX [5] improved [4] using an Long Short-Term Memory (LSTM) context mixer to mix a total of 2,122 independent models. Recently, [6] presented a DeepZip, which combined RNN predictors with an arithmetic coder and compress a variety of text and genomic datasets losslessly. DeepZip outperforms Gzip on the real and synthetic datasets.

All above algorithms are supposed to be run on a desktop computer, and therefore, the power and memory consumption are not been considered. In recent decades, the great development of IoT has brought new challenges for both computer science and communication societies. When billions of IoT devices are connected and communicated, the data have to be processed in a distributed fashion, which is called the edge computing paradigm [7]. Edge computing processes and analyzes data at the edge of Internet, in close proximity to the mobile devices, sensors, and end users. This novel technology allows to provide scalability and privacy-policy enforcement for the IoT, highly responsive cloud services for mobile computing, and the ability to smooth transient cloud outages. Current edge computing devices, e.g.,  NVIDIA Jetson Nano, Google EDGE TPU, Intel Neural Compute Stick2, and Raspberry Pi, could provide low-latency, low-power consumption, and limited memory platform for deep neural network models.

The main problem addressed in this paper is how to effectively perform lossless data compression on the edge computing platform with limited power consumption and less computing complexity. Because AC can achieve the best compression rate than other algorithms, we select AC as the codec. In 2020, [8] proposed EdgeDRNN: a delta RNN accelerator for edge inference. Inspired by this idea, we present DRAC: an AC model based on delta RNN algorithm running on edge computing devices. Our main contributions can be summarized as follows:

  • We describe the DRAC: an AC codec based on delta RNN for edge computing device. To the best of our knowledge, this flexible, low-cost, low latent, and power efficient compressor is first presented by us.

  • We implement DRAC on a Xilinx Zynq 7000 SoC Board. We use this platform to demonstrate that our DRAC model is suitable for the edge devices.

  • Our DRAC outperforms Deepzip in the experiments. We use textual genomic datasets to evaluate DRAC, Deepzip, and Gzip. The experimental results show that DRAC achieves the best results than Deepzip and Gip in terms of compression rate, power consumption, and running time.

The rest of this paper is arranged as follows:“ Model description” describes the model of DRAC. The Delta RNN is introduced in “Delta recurrent neural networks”, and the experimental results are demonstrated in “Experimental results”. “Conclusion” concludes this paper.

Model description

Our DRAC consists of two blocks: AC algorithm and probability predictor. Let \(Y = \{y_0, y_1, ..., y_{N-1}\}\) be the data sequence. The probability predictor includes a Delta RNN model which is trained on the Y for multiple epochs. After training is completed, the trained model weights are known by both encoder and decoder. When seeing the previous J symbols, where J is set to 1 in our model, the probability \(P(y_i|y_{i-1},..., y_{i-J})\) is estimated by Delta RNN and then fed into the AC encoder to perform the encoding process until the final code is output. For the initial J symbols, a uniform distribution is chosen as prior. This process is illustrated in Fig. 1a. The decoding process is just symmetrical to the encoding process. Decoder takes the code as input and repeatedly output the reconstructed symbols using probability from predictor, which is illustrated in Fig. 1b. We describe the AC algorithm block in “AC algorithm” and probability predictor block in “Probability predictor”.

Fig. 1
figure 1

Model of DRAC

AC algorithm

We give the encoding process of AC as follows. Let \({\mathcal {X}} =\{x_0,x_1,...,x_{M-1}\}\) be the alphabet of a discrete source and \(P=\{p(x_0),p(x_1),...,p(x_{M-1})\}\) be the probability of each symbol, which will be predicted by “Probability predictor”. The cumulative probability \(C=\{c(x_0),c(x_1),...,c(x_{M-1})\}\) is defined as

$$\begin{aligned} c(x_i)=\sum _{k=0}^{i} P(x_k) , \end{aligned}$$
(1)

where \(c(x_0)\equiv 0\) and \(c(M)\equiv 1\). AC encoder recursively maps each symbol of input sequence into a subinterval divided from the current interval. Let \(Y = \{y_0, y_1, ..., y_{N-1}\}\) be the input sequence and \(I_0=[L_0, H_0)\) be the initial interval. The new subinterval \(I_j=[L_j, H_j)\) is calculated by

$$\begin{aligned}&H_j=L_{j-1}+(H_{j-1}-L_{j-1})\times c(y_j+1) \end{aligned}$$
(2)
$$\begin{aligned}&L_j=L_{j-1}+(H_{j-1}-L_{j-1})\times c(y_j), \end{aligned}$$
(3)

where \(c(y_j+1)\) is the cumulative probability of the subsequent symbol of \(y_j\) in the alphabet. Once the encoding process ends, any real number D in the final interval can be chosen as the code.

In the decoding process, Let \({\widetilde{Y}} = \{{\widetilde{y}}_0, {\widetilde{y}}_1, ..., {\widetilde{y}}_{N-1}\}\) be the output sequence of AC decoder. Decoder recursively performs the following steps:

  1. (1)

    Determine the symbol \(a_i\) from C based on current code \(D_j\) and output \({\widetilde{y}}_j=x_i\), that is

    $$\begin{aligned} {\widetilde{y}}_j=\{x_i:c(x_i)\le D_j\le c(x_i+1)\}, j\in \{0,1,...,N-1\}. \qquad \end{aligned}$$
    (4)
  2. (2)

    Update \(D_j\) by

    $$\begin{aligned} D_{j+1}=\frac{D_j-c(x_i)}{p(x_i)}. \end{aligned}$$
    (5)

    Once the decoding process ends successfully, the output sequence can be acquired.

Probability predictor

Our probability predictor is a delta RNN network. Its architecture is shown in Fig. 2. For a symbol \(y_i\), the past J symbols \(y_{i-J,}, y_{i-J+1},..., y_{i-1}\) are input to a bi-directional Delta Gated Recurrent Unit (Delta GRU) cell, which will be introduced in “Delta recurrent neural networks”. At the end of Delta GRU cells, a Softmax layer plus a Fully Connected (FC) layer are applied on the hidden states acquired from the Delta GRU cell. Finally, the probability \(p(y_i)\) is obtained. The Softmax layer is defined as

$$\begin{aligned} softmax(z)_i=p_i=\frac{e^{z_i}}{\sum _{j=1}^{|{\mathcal {X}}|}e^{z_j}}, \end{aligned}$$
(6)

where \({\mathcal {X}}\) is the alphabet of the symbols, and \(|\cdot |\) is the cardinality. For input X, the FC layer is defined as

$$\begin{aligned} F = \sigma (XW^T+B), \end{aligned}$$
(7)

where W is the weight matrix, B is the bias, and \(\sigma (x)\) is the activation function.

Fig. 2
figure 2

Architecture of probability predictor

Fig. 3
figure 3

Architecture of GRU cell

Delta recurrent neural networks

The Gated Recurrent Units (GRU) is a class of gated RNN, and is very popular in the natural language processing area. For RNN, each GRU cell is illustrated in Fig. 3a. Let \({\mathbb {R}}\) be the set of real number, and \(X = \{x_0, x_1, ..., x_i, x_{N-1}\}\) be the input sequence with length N. For a GRU cell with H neurons, one-dimensional input \(x_i\), and output \(h_i\), the update equations is defined as

$$\begin{aligned}&\varvec{r}_i = \sigma (\varvec{W}_{xr}x_i+\varvec{W}_{hr}h_{i-1}+\varvec{b}_r) \nonumber \\&\varvec{u}_i = \sigma (\varvec{W}_{xu}x_i+\varvec{W}_{hu}h_{i-1}+\varvec{b}_u) \nonumber \\&\varvec{c}_i = \tanh (\varvec{W}_{xc}x_i+\varvec{r}_i\odot (\varvec{W}_{hc}h_{i-1}+\varvec{b}_c) \nonumber \\&\varvec{h}_i = (1-\varvec{u}_i)\odot \varvec{c}_{i}+\varvec{u}_{i}\odot h_{i-1}, \end{aligned}$$
(8)

where \(\varvec{r,u,c}\in {\mathbb {R}}^H\) are the reset gate, update gate, and cell state, respectively. \(\varvec{W}_x\in {\mathbb {R}}^{H}\), \(\varvec{W}_h\in {\mathbb {R}}^{H\times H}\) are the weight matrices and \(\varvec{b}\in {\mathbb {R}}^{H}\) is bias vector. \(\sigma (x)\) is the activation function and \(\odot \) is the element-wise multiplication of vectors.

To reduce the computing complexity of GRU, [9] proposed the delta recurrent networks. Later, [8] applied DeltaGRU in the edge devices. Inspired by their ideas, we also apply Delta GRU in our model. We define the following variables:

$$\begin{aligned} {\hat{x}}_i&= {\left\{ \begin{array}{ll} x_i, &{}|x_i-{\hat{x}}_{i-1}| \ge \eta _x \\ {\hat{x}}_{i-1} , &{}|x_i-{\hat{x}}_{i-1}|< \eta _x \end{array}\right. } \nonumber \\ {\hat{h}}_i&= {\left\{ \begin{array}{ll} h_i, &{}|h_i-{\hat{h}}_{i-1}| \ge \eta _h \\ {\hat{h}}_{i-1} , &{}|h_i-{\hat{h}}_{i-1}|< \eta _h \end{array}\right. } \nonumber \\ \delta x_i&= {\left\{ \begin{array}{ll} x_i-{\hat{x}}_{i-1}, &{}|x_i-{\hat{x}}_{i-1}| \ge \eta _x \\ {\hat{x}}_{i-1} , &{}|x_i-{\hat{x}}_{i-1}|< \eta _x \end{array}\right. } \nonumber \\ \delta h_i&= {\left\{ \begin{array}{ll} h_i-{\hat{h}}_{i-1}, &{}|h_i-{\hat{h}}_{i-1}| \ge \eta _h \\ {\hat{h}}_{i-1} , &{}|h_i-{\hat{h}}_{i-1}| < \eta _h \end{array}\right. }, \end{aligned}$$
(9)

where \({\hat{x}}_i\) is the input state memory in timestep i, \({\hat{h}}_i\) the hidden state memory in timestep i, \(\delta x_i\) the delta input state, and \(\delta h_i\) the delta hidden state. \(\eta _x\) and \(\eta _h\) are the thresholds, respectively. In the initial time, \(i.e., i=1\), \({\hat{x}}_0, h_0,\) and \({\hat{h}}_0\) are all set to zeros. The update equations of Delta GRU are defined as

$$\begin{aligned}&\varvec{M}_{r,i} = \varvec{W}_{xr}\delta x_i+\varvec{W}_{hr}\delta h_{i-1}+\varvec{M}_{r,i-1} \nonumber \\&\varvec{M}_{u,i} = \varvec{W}_{xu}\delta x_i+\varvec{W}_{hu}\delta h_{i-1}+\varvec{M}_{u,i-1} \nonumber \\&\varvec{M}_{xc,i} = \varvec{W}_{xc}\delta x_i+\varvec{M}_{xc,i-1} \nonumber \\&\varvec{M}_{hc,i} = \varvec{W}_{hc}\delta h_{i-1}+\varvec{M}_{hc,i-1} \nonumber \\&\varvec{r}_i = \sigma (\varvec{M}_{r,i}) \nonumber \\&\varvec{u}_i = \sigma (\varvec{M}_{u,i}) \nonumber \\&\varvec{c}_i = \tanh (\varvec{M}_{xc,i}+\varvec{r}_i\odot \varvec{M}_{hc,i}) \nonumber \\&\varvec{h}_i = (1-\varvec{u}_{i})\odot \varvec{c}_i+\varvec{u}_i\odot \varvec{h}_{i-1} , \end{aligned}$$
(10)

where \(\varvec{M}_{i=0}=\varvec{b}_r, \varvec{M}_{u,i=0}=\varvec{b}_u, \varvec{M}_{xc,i=0}=\varvec{b}_c, \varvec{M}_{hc,i=0}=0\) are delta memory vectors and \(\varvec{M} \in {\mathbb {R}}^H\).

Comparing GRU operation, the Delta GRU will prune the weights and remove unimportant neuron connections that result in sparse weight matrices. Sparse matrices can be encoded into a sparse matrix format. The sparse matrix-vector multiplication can be accelerated by executing multiply-and-accumulate (MAC) operations only on nonzero weights. By skipping zero elements in delta vectors, whole columns of matrix-vector MAC operations can be skipped. After the model was properly trained, the experiments show that in delta networks, the number of operations can be reduced by 5-to-100 times [8, 9].

Experimental results

We report the experimental results of DRAC model in this section.

Datasets

Four datasets are used in our experiments, which are listed in Table 1. The webster [10] and enwiki9 [11] are text data downloaded from web resource, in which the symbols are from ASCII and the alphabet size are 96 and 206, respectively. The h.chr1 [12] and c.e.genome [13] are genomic data, in which the symbols are \(\{A,C,G,T\}\) and the alphabet size is 4. All these four datasets are from real data. In [6], the model is also evaluated on the synthetic datasets, which are the artificial data generated from synthetic sources of known entropy. Due to the limited length of the paper, we will experiment synthetic datasets in the future research.

Training process

To train the delta RNN based model, we take the categorical cross entropy as the loss function, which is defined as

$$\begin{aligned} L = \frac{1}{N}\sum _{n=1}^{N}\sum _{m=1}^{M}p_m\log _2\frac{1}{{\hat{p}}_m}, \end{aligned}$$
(11)

where N is the sequence length, M is the alphabet size, \(p_m\) is the one-hot encoded ground truth, and \({\hat{p}}_m\) is the predicted probability. An Adam [14] optimizer is used to minimize the categorical cross entropy L. The sequence is divided into \(\lfloor \frac{N}{K+1}\rfloor \) segments, and the length of each segment is \(K+1\). In each segment, let the first K symbols be the input and last one be the output. We set K at 64 and the batch size at 1024. In the optimization, the maximum epochs is 10 and the learning rate is \(3e-4\). When there is no further improvement before 10 epochs, the training is terminated. In every epoch, if there is a decrease in L from the previous minima, the model is updated. This training process is similar to that introduced in [6, 8].

Implementation

The arithmetic coding block in DRAC is borrowed from an open-source code: Arithmetic Coding Library [15], which is implemented by Python. We made very little modification to enable it to be integrated with predictor block. The Delta RNN block is implemented with Pytorch 1.2.0 framework. The training process was performed on one NVIDIA TITAN V GPU with CUDA 10.0 and cuDNN 7.6.

Our edge device is a Xilinx Zynq-7000 SoC on a customized baseboard. It has a dual ARM Cortex-A9 CPU, a Kintex-7 XC7Z100 FPGA, and 1 GB DDR3 SDRAM. Its operating system is an embedded Linux OS by Xilinx called PetaLinux. Once the DRAC is trained, we use a Python script to convert the Python network modules into C++ header files, which include the network parameters for Delta RNN. The application is compiled with the standard cross compiler and the resulting image is downloaded into the flash of the board.

Table 1 Datasets

Delta threshold

Comparing RNN, the major improvement made by Delta RNN is the introduce of delta threshold \(\eta _x\) and \(\eta _h\). In this section, we will use the experiments to investigate the effects of \(\eta \) (\(\eta = \eta _x = \eta _h\)). The Delta RNN was trained by the datasets with different delta thresholds \(\eta \) ranging from 0x00 to 0x80 (\(\eta \) is in 16bit Q8.8 format. 0x80 in Q8.8 format corresponds to 0.5 in decimal numbers). The predicted probability error is calculated to evaluate the accuracy. The predicted probability accuracy as the function of delta threshold is shown in Fig. 4. We find that at small \(\eta \), the accuracy keeps almost \(100\%\). When \(\eta \) is bigger than 0x80, the accuracy greatly decreases.

We take a sample data from webster with the length of 10M and use it to measure the calculation time of the prediction. Figure 4 shows that with the increase of \(\eta \), calculation time gradually decreases. This is because big delta threshold will allow many MAC operations to be skipped, and reduced operation will result in reduced power consumption and calculation time. We make the compromise between accuracy and calculation time, and choose \(\eta =0x80\) as the optimal value in the following experiments.

Fig. 4
figure 4

Predicted probability accuracy as the function of delta threshold

Performance comparison

To investigate the performance of our model, we compare DRAC with DeepZip in terms of compression rate, running time, and power consumption.

Four datasets (Table 1) were used in the experiment. We choose two versions of DeepZip [6], i.e.,  DeepZip-b, whose neural network is implemented by a bi-directional variant of the GRU and DeepZip-l, whose neural network is made by multiple LSTM cells. This setup is because DeepZip-b and DeepZip-l are the state-of-the-art lossless compressor. All DeepZip models are tested on a desktop server with Intel i7-6800K CPU, and an NVIDIA TITAN V GPU. The CPU power is measured by the Stress Terminal UI monitoring tool [16] and the GPU power is measured using the nvidia-smi utility. The DRAC model is performed on Zynq-7000 board and the total power is measured by the wall plug power meter. The running time are also measured. All experimental results are reported in Table 2, in which the best results are boldfaced.

Table 2 Experimental results

The experimental results show that, in terms of compression size, DRAC nearly achieves same performance with Deepzip, but in terms of running time and power consumption, DRAC greatly outperforms Deepzip. DRAC nearly achieves 5X speedup ratio and 20X power consumption saving than Deepzip. This improvement is due to the introduce of Delta RNN. The experiment shows that DRAC is suitable for the edge computing framework.

Conclusion

We present the DRAC: an AC codec based on Delta RNN for lossless compression, and implement it on Zynq-7000 SoC board. This setup makes it suitable for the edge computing environment. We compare DRAC with DeepZip under four datasets, and DRAC outperforms DeepZip.

This work has two main limits: the selection of the delta threshold \(\eta \) is in a heuristic manner, and the datasets are not plentiful enough. In the future research, we hope to present a guided searching algorithm to quickly converge onto optimal \(\eta \) by exploiting a dynamic trade-off of accuracy versus running time. We will also evaluate DRAC on more real and synthetic datasets.