1 Introduction

Different from the image formats like JPEG or TIFF, the GIF (Graphics Interchange Format) image is composed of a color palette and a set of index values. For its prevalence in social network applications, it is quite suitable for covert communication by hiding secret data into GIF images. GIF images can be divided into two categories, namely, the static GIF images and the dynamic GIF images. The dynamic is more popular in online social networks (OSN). Animated GIF is a type of dynamic GIF, it is often used to enrich people’s social expressions and emotional performance. The most popular GIF is the animated emoji, which is now widely used in OSN like WeChat, Twitter, and Weibo.

Steganography is a method of embedding secret messages into digital covers without introducing serious distortion [1]. In the early times, there were many steganography methods, such as LSB (Least Significant Bits) replacement, F5 [2], etc. LSB replacement is the simplest steganography method [3]. By modifying the least significant binary of pixel to store information, the human eye cannot perceive the changes. F5 uses matrix embedding to hide secret messages into the JPEG images. However, these methods are fragile against modern steganography analysis. Recently, the Syndrome Trellis Coding (STC) framework is popular for steganography [4], which tries to minimize the additive distortion between the cover and the stego using a predefined distortion function. Generally, the distortion function assigns different distortion costs to different elements of cover. For the spatial images, HILL (High-pass, Low-pass, and Low-pass) [5], WOW (Wavelet Obtained Weights) [6], and SUNIWARD (Spatial UNIversal Wavelet Relative Distortion) [7, 8] are widely proposed. Meanwhile, JUNIWARD (JPEG UNIversal WAvelet Relative Distortion) [7, 8], UED (UNIversal WAvelet Relative Distortion) [9], UERD (Uniform Embedding Revisited Distortion) [10] and HDS (Hybrid Distortion Steganography) [11] are widely used for JPEG images, in which the coefficients of the transfer domain are modified according to the distortion costs. Besides, adaptive methods are proposed to guide the modification direction of the coefficients, which can often improve the security of the modified image [12,13,14,15].

As an adversary, steganalysis is used to break steganography, which analyzes the features of an image to determine whether it contains secret messages [16]. The rapid development of steganalysis has brought a great challenge to steganography. Generally, the steganalysis is a kind of classifier that learns the differences of the features between the cover and the stego [17]. The security of a steganography method can be evaluated by observing the accuracy of the classifier. There are many feature extraction methods, e.g., SPAM (subtractive pixel adjacency model) [18], SRM (Spatial Rich Model) [19], DCTR (Discrete Cosine Transform Residual) [20], and GFR (Gabor Filters Residual) [21].

As the GIF images are widely used, researchers are paying more attention to GIF steganography. In [22], the first steganography algorithm is proposed for indexed images such as static GIF. The scheme searches for the closest color in the palette to reduce the distortion caused by data hiding. In [23], adaptive strategies are proposed to determine which pixels should be modified to embed data. To the best of our knowledge, the first method of embedding data into dynamic GIF images is proposed in [24]. Subsequently, more steganography approaches for animated GIF are proposed [25,26,27,28]. In [28], the researchers propose a framework to embed data into animated GIF using the difference between adjacent pixels in the same frame. In [29], a method is proposed to hide data into the animated emoji GIF using the STC framework, in which the distortion functions are improved to achieve better security.

In this paper, we propose a steganography scheme in animated emoji using self-reference, in which we integrate the data embedding impacts for the intra frames and the inter frames. We also provide an algorithm of generating a reference image for guiding the data embedding. With these algorithms, we can achieve a better performance of countering steganalysis. The rest of this paper is organized as follows. We introduce the backgrounds of GIF steganography in Section II. The proposed framework is described in Section III. Section IV shows the experimental results and analysis. Section V concludes the whole paper.

2 Preliminaries

Let there be \(K\) frames in an emoji GIF image. Each frame is a color index matrix I with the size of \(M \times N\). The image contains a color palette, in which a limited amount of colors are represented, e.g., 256 colors for an 8-bit palette.

The pixels are represented by \({\varvec{I}}_{ij}\), where \(i \in \left\{ {1, 2, \ldots , M} \right\}\) and \(j \in \left\{ {1, 2, \ldots , N} \right\}\). The value of each pixel is represented by an index \(l\) defined in the palette \({\varvec{C}}_{l}\), where \(l \in \left\{ {0,1, 2, \ldots ,255} \right\}\). Accordingly, an RGB value (\(R_{ij} ,G_{ij} ,B_{ij}\)) can be constructed from the index \({\varvec{I}}_{ij}\). Figure 1 illustrates the composition of an emoji GIF image.

Fig. 1
figure 1

Illustration of the composition of an emoji GIF image

We denote the cover and the stego images as X and Y, respectively. The pixels are represented by \({\varvec{X}}_{ij}\) and \({\varvec{Y}}_{ij}\). After embedding data into any pixel \({\varvec{X}}_{ij}\) in X, we obtain the pixel \({\varvec{Y}}_{ij}\) and the stego Y. The modification is either binary or ternary. In ternary embedding, each pixel in the stego is \({\varvec{Y}}_{ij} \in \left\{ {{\varvec{X}}_{ij} + 1, {\varvec{X}}_{ij} , {\varvec{X}}_{ij} - 1} \right\}\).

To minimize the change of RGB values caused by data modification, the method in [29] proposes a palette sorting algorithm. First, it calculates the square sum of the pixel values of the RGB channels corresponding to the \(l\)-th index value in the palette \(C_{l}\).

$$t_{\left( l \right)} = R\left( l \right)^{2} + G\left( l \right)^{2} + B\left( l \right)^{2}$$
(1)

After ascendingly sorting the obtained values, we obtain a sorted palette \(C_{l} ^{\prime}\). Based on the new palette, we can regenerate a new index matrix \({\varvec{I}}_{k} ^{\prime}\), where \(k\) represent the \(k\)-th frame.

Let the embedding costs of ternary embedding be \({ }\rho_{ij}^{ + }\), \(\rho_{ij}\) and \(\rho_{ij}^{ - }\), respectively, where \(\rho_{ij}\) = 0, \(\rho_{ij}^{ + } \in \left( {0, + \infty } \right)\), \(\rho_{ij}^{ - } \in \left( {0, + \infty } \right)\). The generated additive distortion function \(D\left( {{\mathbf{X}}, {\mathbf{Y}}} \right)\) is the sum of the embedding costs of all pixels.

$$D\left( {{\varvec{X}}, {\varvec{Y}}} \right) = \mathop \sum \limits_{i = 1, j = 1}^{i = M,j = N} \rho_{ij} \left( {{\varvec{X}}_{ij} ,{\varvec{Y}}_{ij} } \right)$$
(2)

To embed a secret message into cover X, the framework of Syndrome Trellis Coding (STC) requires a modification probability \(p_{ij}\) for each pixel. According to [32], the modification probability \(p_{ij}\) and embedding cost \(\rho_{ij}\) can be calculated by (3).

$$p_{ij}^{\left( I \right)} = \frac{{e^{{ - \lambda \rho_{ij} (I)}} }}{{\mathop \sum \nolimits_{{I \in \left\{ { + 1, 0, - 1} \right\}}} e^{{ - \lambda \rho_{ij} (I)}} }}$$
(3)

In (3), when the embedding is binary, \(\left| I \right|\)= 2; and when it is ternary, \(\left| I \right|\)= 3. Because the embedding cost \(\rho_{ij}\) is known, \(p_{ij}\) can be placed in (4) to obtain the parameter λ, where the m is the amount of secrete data to be embedded by the data-hider.

$$H\left( p \right) = - \mathop \sum \limits_{i = 1, j = 1}^{i = M,j = N} p_{ij} \log p_{ij} = m$$
(4)

3 Proposed framework

The proposed scheme is depicted in Fig. 2. First, after the sort the palette, we decompose the animated GIF into several frames. For each frame, we retrieve the RGB values of every pixel according to the GIF color palette and convert each fame \({{\varvec{I}}}_{k}^{\prime}\) into a color image \({F}_{k}\), where k is the frame index. Then, we construct a reference frame \({\widehat{F}}_{k}\) for each frame. With \({\widehat{F}}_{k}\) we optimize the embedding costs. We further improve the inter-frame distortion using the previous frame as a reference. After embedding data into each frame, we obtain a stego GIF.

Fig. 2
figure 2

Overview of the proposed scheme

  1. A.

    Improved bipolar embedding.

Since GIF is a compressed format, it can be regarded as a 256-color image compressed from a true-color image. Therefore, we can use the original content of the image before GIF compression to improve Bipolar Embedding. We convert the GIF fames into the color images \({ }{\varvec{F}} = \left\{ {{\varvec{F}}_{1} , \ldots ,{\varvec{F}}_{{\text{K}}} } \right\}\) according to the palette, where K is the number of frames. For each frame, every pixel B = (\(R_{ij} ,G_{ij} ,B_{ij}\)) has a corresponding pixel A = \(\left( {\hat{R}_{ij} ,\hat{G}_{ij} ,\hat{B}_{ij} } \right)\) at the same location before GIF compression. It is equivalent to shift the RGB value from point A to point B. Vector AB stands for the distortion during compression.

When we embed secret messages into \({\varvec{I}}_{ij}\), the pixels are either added or subtracted by one. In (5), a refers to the Hamming distance between A and B. We also use C = (\(R_{ij}^{ + } , G_{ij}^{ + } , B_{ij}^{ + }\)) or (\(R_{ij}^{ - } , G_{ij}^{ - } , B_{ij}^{ - }\)) as the corresponding pixel at the same location after embedding. In (6) and (7), the Hamming distances caused by + 1 and − 1 operation are defined as b+ and b, respectively.

$${\varvec{a}} = \left[ {\left( {R_{ij} - \hat{R}_{ij} } \right),\left( {G_{ij} - \hat{G}_{ij} } \right),\left( {B_{ij} - \hat{B}_{ij} } \right)} \right]$$
(5)
$${\varvec{b}}^{ + } { } = \left[ {\left( {R_{ij}^{ + } - R_{ij} } \right),\left( {G_{ij}^{ + } - G_{ij} } \right),\left( {B_{ij}^{ + } - B_{ij} } \right)} \right]$$
(6)
$${\varvec{b}}^{ - } = \left[ {\left( {R_{ij}^{ - } - R_{ij} } \right),\left( {G_{ij}^{ - } - G_{ij} } \right),\left( {B_{ij}^{ - } - B_{ij} } \right)} \right]$$
(7)

Subsequently, we define the modification angle between a and b+ or b as \(\theta^{ + }\) or \(\theta^{ - }\) in (8) and (9). The operator \(\left| \cdot \right|\) stands for the module of vector.

$$\theta^{ + } = \arccos \left( {\frac{{\user2{a } \cdot { }{\varvec{b}}^{ + } }}{{\left| {\varvec{a}} \right| \cdot \left| {{\varvec{b}}^{ + } } \right|}}} \right)$$
(8)
$$\theta^{ - } = \arccos \left( {\frac{{\user2{a } \cdot { }{\varvec{b}}^{ - } }}{{\left| {\varvec{a}} \right| \cdot \left| {{\varvec{b}}^{ - } } \right|}}} \right)$$
(9)

Figure 3 illustrates the cases that the modification angles are acute or obtuse, respectively. Vector AB stands for the distortion during compression and vector BC stands for the distortion caused by data embedding.

Fig. 3
figure 3

Pixel modification for GIF where modification angle is: a acute, b obtuse

In Fig. 3a, when \(\theta\) between the two vectors AB and BC is acute, the compression direction is the same as the embedding direction. In the other words, the extra error would be the embedding error “plus” the compression error. In Fig. 3b, when \(\theta\) is an obtuse angle, the compression direction and embedding direction are the opposite. Then, the extra error would be the embedding error “minus” the compression error. We denote the extra error AC in Fig. 3 as c+ and c, in (10).

$$\left\{ {\begin{array}{*{20}c} {{\varvec{c}}^{ + } = a + {\varvec{b}}^{ + } } \\ {{\varvec{c}}^{ - } = a + {\varvec{b}}^{ - } } \\ \end{array} } \right.$$
(10)

In summary, when \(\theta^{ + } \in \left( {0, \pi /2} \right)\) or \(\theta^{ - } \in \left( {0, \pi /2} \right)\), \(\left| {{\varvec{c}}^{ + } } \right|\) or \(\left| {{\varvec{c}}^{ - } } \right|\) is larger than each element in \(\left\{ {\left| {\varvec{a}} \right|,\left| {{\varvec{b}}^{ + } } \right|,\left| {{\varvec{b}}^{ - } } \right|} \right\}\). When \(\theta^{ + } \in \left( {\pi /2, \pi } \right)\) or \(\theta^{ - } \in \left( {\pi /2, \pi } \right)\), c+ or c is smaller than anyone in \(\left\{ {\left| {\varvec{a}} \right|,\left| {{\varvec{b}}^{ + } } \right|,\left| {{\varvec{b}}^{ - } } \right|} \right\}\). Therefore, when \(\theta^{ + }\) or \(\theta^{ - }\) is obtuse, \(\left| {\varvec{c}} \right|\) would be smaller. Therefore, we prefer the pixel modification when the angle \(\theta^{ + }\) or \(\theta^{ - }\) be obtuse, and restrict data embedding when \(\theta^{ + }\) or \(\theta^{ - }\) is acute.

This algorithm reduces the extra error caused by embedding modification and makes the compressed image closer to the original image. Thus, it can improve the security of steganography effectively.

  1. B.

    Reference construction.

According to the aforementioned analysis, if a data-hider have the original content of the image before GIF compression, a better performance of a security can be achieved by modifying the pixel values toward the original values. However, in most cases, the data hider does not have the original content of the images before compression. Therefore, we use an algorithm to construct a reference image for each frame.

To achieve a satisfactory performance, the constructed reference images should be close to the original image before GIF compression. Inspired by [30], we can treat the image compression as a procedure of adding noise into the original content. To remove this kind of noise, we propose to use the DnCNN model proposed in [31] to construct a reference image. This model has been proved to be useful in many denoising tasks. Different from the existing denoising methods that are defined for additive white Gaussian noise at a certain noise level, the DnCNN model is able to handle Gaussian denoising with the unknown noise level. Besides, this model is able to handle multiple general image denoising tasks, such as Gaussian denoising, single image super-resolution, and JPEG image deblocking.

Denote the luminance of the original image as YOri and the GIF-compressed image as YComp. As shown in Fig. 4, the residual image is the difference between the original image and compressed image in the luminance channel. We denote YRes as the residual image, where YOri = YComp − YRes. In other words, the residual image can be regarded as a kind of image noise. With a residual learning strategy, the residual image can be estimated [31]. DnCNN is trained on the luminance channel because human perception is more sensitive to changes in brightness than changes in chrominance.

Fig. 4
figure 4

The generation of residual image

In Fig. 5, an animated emoji is decomposed into several frames, then the GIF-compressed images are converted from RGB space to the YCbCr space. The DnCNN network is trained to detect the residual images from the luminance of the color frames. Three different colors represent three types of layers. For the first layer, marked in yellow, 64 filters are used to generate 64 feature maps. ReLU is nonlinearity activation function. For layers 2∼(D − 1), marked in blue, 64 filters sized 3 × 3 × 64 are used. The batch normalization is added between convolution and ReLU. It is incorporated to speed up training as well as boost the denoising performance. The orange layer is the last layer, in which filters sized 3 × 3 × 64 are used to reconstruct the output. D is the depth of DnCNN. For an image denoising task, the depth of network (the number of convolution layers) is generally specified as 20.

Fig. 5
figure 5

The architecture of residual image construction network

With the model of DnCNN, we can reconstruct an undistorted version of a compressed frame by subtracting the compressed luminance channel from a residual image, and then converts the image back to the RGB color space.

We denote the \(k\)-th RGB frame of GIF as \(F_{k}\), which is constructed from \({\varvec{I}}_{k} ^{\prime}\) and \(C_{l} ^{\prime}\). Let the reference frame be \(\hat{F}_{k}\). \(Y_{{\hat{F}_{k} }}\) and \(Y_{{F_{k} }}\) are the luminance channel of \(\hat{F}_{k}\) and \(F_{k}\). The reference frame \(Y_{{\hat{F}_{k} }}\) can be calculated by (11)

$$Y_{{\hat{F}_{k} }} = Y_{{F_{k} }} - {\text{Dn}}\_{\text{CNN}}\left( {Y_{{F_{k} }} } \right)$$
(11)

where \({\text{Dn}}\_{\text{CNN}}\left( \cdot \right)\) represents the DnCNN denoising network. Concatenate the denoised luminance channel \(Y_{{\hat{F}_{k} }}\) with the original chrominance channels to obtain the denoised image in the YCbCr space. We further convert the YCbCr image to RGB to generate the reference image \(\hat{F}_{k}\). The reference image \(\hat{F}_{k}\) is close to the original image \(\tilde{F}_{k}\).

  1. C.

    Distortion function improvement.

Let \(\rho_{ij}^{ + }\) and \(\rho_{ij}^{ - }\) be the embedding cost for + 1 and − 1 in the intra-frame embedding, respectively, where \(i \in \left\{ {1, \ldots ,M} \right\}, j \in \left\{ {1, \ldots ,N} \right\}\). In many steganography methods based on STC, \(\rho_{ij}^{ + }\) is identical to \(\rho_{ij}^{ - }\). In the proposed method, we improve the embedding cost function according to the RGB value (\(\hat{R}_{ij} ,\hat{G}_{ij} ,\hat{B}_{ij}\)) of the reference image \(\hat{F}\).

We first initialize the original costs \(\rho_{ij}\) for the pixels in each frame using the traditional distortion functions like HILL [5], WOW [6] and UNIWARD [7, 8]. For each pixel, we multiply the original cost \(\rho_{ij}\) by a factor \({\upalpha }\). There are two cases for the factor \({\upalpha }\) when performing \(\pm\) 1 operations, namely, \(\alpha^{ + }\) or \(\alpha^{ - }\), which are depicted in (12) and (13). We adjust the distortion function in (14) and (15) by combining three optimization factors, in which wetCost represents a very large value, e.g., 108 in the experiments.

$$\alpha^{ + } = \left| {{\varvec{a}} + {\varvec{b}}^{ + } } \right|/\left( {\left| {\varvec{a}} \right| + \left| {{\varvec{b}}^{ + } } \right|} \right)$$
(12)
$$\alpha^{ - } = \left| {{\varvec{a}} + {\varvec{b}}^{ - } } \right|{ }/{ }\left( {\left| {\varvec{a}} \right| + \left| {{\varvec{b}}^{ - } } \right|} \right)$$
(13)
$$\rho_{ij}^{ + } = \left\{ {\begin{array}{ll} {{\text{wetCost }} \quad {\text{if}}\, \theta^{ + } \in \left( {0, \pi /2} \right)} \\ {\rho_{ij} \quad {\text{if}}\, \theta^{ + } = \pi /2} \\ {\alpha^{ + } \cdot \rho_{ij} \quad {\text{if}}\, \theta^{ + } \in \left( {\pi /2, \pi } \right)} \\ \end{array} } \right.$$
(14)
$$\rho_{ij}^{ - } = \left\{ {\begin{array}{ll} {{\text{wetCost }} \quad {\text{if}}\, \theta^{ - } \in \left( {0, \pi /2} \right)} \\ {\rho_{ij} \quad {\text{if}}\, \theta^{ - } = \pi /2} \\ {\alpha^{ - } \cdot \rho_{ij} \quad {\text{if}}\, \theta^{ - } \in \left( {\pi /2, \pi } \right)} \\ \end{array} } \right.$$
(15)

On the other hand, during data hiding, there would be differences between adjacent frames. Therefore, we must consider the impact of inter-frame embedding. For each frame \(F_{k}\), we use the previous frame \(F_{k - 1}\) as reference. The RGB values (\(R_{ij}^{k - 1} ,G_{ij}^{k - 1} ,B_{ij}^{k - 1}\)) from \(F_{k - 1}\) is used to guide the modification of the current frame. The procedures of inter-frame embedding are similar except that a is redefined in (16) and the cost for inter-frame embedding are redefined as \(\mathop {\rho_{ij}^{ + } }\limits^{ \cdot }\) and \(\mathop {\rho_{ij}^{ - } }\limits^{ \cdot }\) in (17) and (18). Unlike a defined in (5), a redefined in (16) represents the change of direction of RGB values at the same position of adjacent frames. Then we get \(\alpha^{ + }\) and \(\alpha^{ - }\) by applying (12) and (13), and update the distortion function as (17) and (18).

$${\varvec{a}} = \left[ {\left( {R_{ij} - R_{ij}^{k - 1} } \right),\left( {G_{ij} - G_{ij}^{k - 1} } \right),\left( {B_{ij} - B_{ij}^{k - 1} } \right)} \right]$$
(16)
$$\mathop {\rho_{ij}^{ + } }\limits^{ \cdot } = \left\{ {\begin{array}{ll} {{\text{wetCost }}\quad {\text{if}}\, \theta^{ + } \in \left( {0, \pi /2} \right)} \\ {\rho_{ij}\quad {\text{if}}\, \theta^{ + } = \pi /2} \\ {\alpha^{ + } \cdot \rho_{ij}\quad {\text{if}}\, \theta^{ + } \in \left( {\pi /2, \pi } \right)} \\ \end{array} } \right.$$
(17)
$$\mathop {\rho_{ij}^{ - } }\limits^{ \cdot } = \left\{ {\begin{array}{ll} {{\text{wetCost }}\quad {\text{if}}\, \theta^{ - } \in \left( {0,\pi /2} \right)} \\ {\rho_{ij}\quad {\text{if}}\, \theta^{ - } = \pi /2} \\ {\alpha^{ - } \cdot \rho_{ij}\quad {\text{if}}\, \theta^{ - } \in \left( {\pi /2,\pi } \right)} \\ \end{array} } \right.$$
(18)

Finally, we combine the cost in intra-frame and inter-frame embedding, and obtain the final distortion function in (19).

$$\left\{ {\begin{array}{*{20}c} {\overline{{\rho_{ij}^{ + } }} = \mathop {\rho_{ij}^{ + } }\limits^{ \cdot } \times \rho_{ij}^{ + } } \\ {\overline{{\rho_{ij}^{ - } }} = \mathop {\rho_{ij}^{ - } }\limits^{ \cdot } \times \rho_{ij}^{ - } } \\ \end{array} } \right.$$
(19)
  1. D.

    Payload allocation.

For most animated GIF, the patterns on the different frame are different. To improve data security of each frame, we adaptively allocate different payloads to the frames according to their characteristics. We adopt the algorithm proposed in [32] for the purpose, in which an \(m\)-bit secret message is embedded into \(n\) covers with a minimized distortion. The distortion is calculated in (20),

$$D_{{{\text{min}}}} \left( {m,n,\rho } \right) = \mathop \sum \limits_{i = 1,j = 1}^{i = M,j = N} \rho_{ij} p_{ij}$$
(20)

and the optimization problem is defined in (21),

$$\begin{gathered} {}_{{D_{{{\text{min}}}} }}^{{{\text{min}}}} D_{{{\text{min}}}} \left( {m,n,\rho } \right) \hfill \\ {\text{subject to}} \, H\left( p \right) = m \hfill \\ \end{gathered}$$
(21)

In this paper, the optimization problem is defined as (22),

$$\begin{gathered} {}_{{D_{{{\text{min}}}} }}^{{{\text{min}}}} D_{{{\text{min}}}} \left( {m,n,\overline{\rho }} \right) = \mathop \sum \limits_{k = 1}^{n} \overline{\rho }_{k} \overline{p}_{k} { } \hfill \\ {\text{subject to}} \mathop \sum \limits_{k = 1}^{K} H\left( {\overline{p}_{k} } \right) = m \hfill \\ \end{gathered}$$
(22)

where \(n\) is the sum of the number of all selected GIF frames, \(\overline{\rho }_{k}\) is the embedding cost of the \(k\)-th frame, and \(\overline{p}_{k}\) is the embedding possibility of the \(k\)-th frame. After calculating the distortion function, we input the embedding costs and the payloads of all frames into the constraints in (22), and obtain the modification probability of each frame. The embedding payload of each frame can be calculated by (23)

$$m_{k} = \mathop \sum \limits_{k = 1}^{n} H\left( {\overline{p}_{k} } \right)$$
(23)

4 Experimental results

To verify the proposed framework, we have conducted many experiments on the emoji GIF dataset provided by [29] that contains 560 animated GIFs. Several examples are shown in Table 1. These GIF images are in 8-bit palette format where each image contains 256 colors.

Table 1 Several examples of the dataset

We use binary pseudo-random sequences as the hidden data, i.e., the possibilities for zero and one are identical. We use the popular HILL, UNIWARD and WOW as initial distortion functions. The reference images are generated by DnCNN. We name the proposed steganography based on the improved versions of HILL, UNIWARD and WOW as PD-HILL, PD-UNIWARD and PD-WOW, respectively.

The embedding tasks are done by the STC framework. The capacity of secret data embedded in each frame are set as 600, 700, 800, 900, 1000, and 1100 bits, respectively. Subsequently, we also use the payloads of 0.05bpp, 0.1bpp, 0.15bpp, 0.2bpp and 0.25bpp for further comparisons.

For steganalysis, we use the ensemble classifier and the feature sets of SPAM and SRMQ1. Half of the cover and stego are used for training and the others are for testing data. The minimal total error \(P_{{\text{E}}}\) is used as the criterion to evaluate the performances of steganography. In (24), \(P_{FA}\) is the false alarm rate and \(P_{MD}\) is the missed detection rate. The average \(P_{{\text{E}}}\) by 10 random tests is used to evaluate the performance [17].

$$P_{{\text{E}}} = {}_{{P_{FA} }}^{{{\text{min}}}} \left( {\frac{{P_{FA} + P_{MD} }}{2}} \right)$$
(24)

When testing the security of intro-frame steganography, we convert every fame of GIF into a colorful image and transform them into gray images. SPAM or SRMQ1 are used to extract the features. When testing the security of inter-frame steganography, we calculate the difference between two frames, which is used for the subsequent steganalysis.

Table 2 provides the embedding test for an emoji image with different payloads and algorithms. The original HILL method is not effective in embedding a large payload in GIF due to texture simplicity, as obvious pepper and salt noises can be found in the smooth areas. While embedding data using the method in [29], the pepper and salt noises appear on the edges of the image. With our method, there are no obvious noises in the edge and texture areas.

Table 2 Embedding test for an animated GIF under different payloads and algorithms

To show the effectiveness of the proposed framework, we use the same experiment setting as [29]. The proposed method PD-HILL, PD-WOW and PD-UNIWARD are used to embed the same amount of message into the same dataset. Table 3 show the testing errors of the PD-HILL, [29] and HILL against SPAM and SRMQ1. The results show that the proposed method has better visual quality as well as security.

Table 3 Testing errors of the PD-HILL, [29] and HILL against SPAM and SRMQ1 under low capacity

We further apply larger embedding payloads, i.e., 0.05 bpp ~ 0.25 bpp. Many GIF images cannot accommodate large amounts of secret messages when using HILL. Therefore, we only compare our method with [29]. In Fig. 6, we use different initial distortion functions. The results show that the proposed method outperforms [29] in most cases.

Fig. 6
figure 6

Comparisons using a HILL b WOW and c UNIWARD

Finally, we conduct the inter-frame security experiments, which are compared with [29]. We use ensemble classifier to calculate the \({P}_{\mathrm{E}}\) between frames. Table 4 shows the inter-frame testing errors of the PD-HILL and [29]. It can be seen that the proposed method has achieved better performance.

Table 4 The testing errors on inter-frame of the PD-HILL and [29] against SPAM and SRMQ1

5 Conclusions

In this paper, we propose an improved steganography method for animated emoji using self-reference. We first construct the reference images by DnCNN network. Guided by the reference images, we adaptively modify the pixels according to the Hamming distances in RGB color space after conducting + 1 and − 1 operations. We further use the current frame as a reference to improve the security of steganography between frames. Several typical loss functions such as HILL are used and the embedding is done by the STC framework. Experimental results show that the security performances of the proposed method outperform state-of-the-art steganography method for animated emoji images.