Keywords

1 Introduction

Steganography is the process of concealing secret information within an ordinary image. The image which is used for hiding the secret information is called a cover image. The image after embedding a secret message is known as a stego image. Steganography can be categorized into two types- (1) Spatial domain and (2) Transform domain steganography. The spatial domain steganography hides the secret by modifying the pixels of the cover image. The DCT domain steganography conceals the secret message within the DCT coefficients [1] of the image. One of the traditional steganography schemes such as LSB replacement [20] hides the secret information in the least significant bits (LSB) of the image pixels. Modern steganography schemes such as HUGO [15], wow [5], S-UNIWARD [6] hides a secret message by minimizing some heuristically defined distortion function. The distortion function assigns a high cost when embedding in the smooth regions in the image whereas a low cost to the noisy areas of the image.

Steganalysis is the process of detecting the trace of the hidden message in the given image. Steganalysis can be broadly divided into two types: (1) Blind and (2) Targeted steganalysis. Blind steganalysis detects the embedding in the image without knowing the steganographic algorithm used for embedding. The Targated steganalysis utilizes the knowledge of the steganographic scheme used for embedding. Steganalysis methods work in two stages. In the first stage, features are extracted using some tools, and in the second stage, classification is done based on the extracted features. The distortion function based recent steganographic schemes are more likely to distribute the secret message in the noisy or high textured area of the image than the flat areas. Therefore, the steganalysis schemes assume that the steganographic noise lies in the high-frequency components of the source image, thereby, strive to capture these feature for steganalysis.

2 Related Works

A lot of steganalysis works have been reported in the literature. These works can be broadly categorized into two types: (1) Handcrafted feature based and (2) Deep feature based steganalysis.

The conventional handcrafted feature based methods use some fixed handcrafted filters to extract the steganalytic features which are used for steganalytic classification. Pevny et al. introduced the Subtractive Pixel Adjacency Matrix (SPAM) [14], which utilized the fact that the steganographic noise alters the dependencies between the neighboring pixels of an image. Using higher-order Markov chain [19] these dependencies are captured. The transition probability matrix of the Markov chain is used as features to train an SVM classifier [4] for steganalysis. The Spatial Rich Model (SRM) [3] is proposed by Fridrich and Kodovsky, which uses several linear and non-linear filters to compute noise residual, followed by 106 different submodels to capture diverse kinds of relationships between neighboring pixels of noise residual. The submodels are used to train the Ensemble Classifier (EC) [10] for steganalytic classification. SRM [3] showed considerable improvement in detecting the trace of steganographic embedding in images over SPAM [14]. The performance of classifiers depends on the quality of features supplied to the classifiers. Handcrafted feature based schemes such as SRM [3] and SPAM [14] which rely on several fixed handcrafted filters, may be suboptimal in extracting all the precise steganalytic features.

Convolutional Neural Networks (CNN) are known for best automatic feature extractors which mitigate the problems of handcrafted feature extraction. Recently, many steganalysis works have been reported in the literature; some of them are as follows: Qian et al. proposed GNCNN [16], a CNN based model for steganalysis. The GNCNN [16] comprise of a fixed preprocessing layer and five convolution layers for feature extraction, followed by three fully connected layers for classification. GNCNN used a Gaussian activation to capture stego and cover signals more precisely. The preprocessing layer has a fixed high-pass filter, which exposes the stego noise and suppresses the image component. GNCNN reported a comparable results with SRM [3] on S-UNIWARD [6], HUGO [15], and WOW [5]. Xu et al. proposed XuNet [21] comprised of a preprocessing layer with a high-pass filter and five groups of layers for feature extraction followed by a fully-connected network for classification. Each group consists of a convolution layer followed by an average pooling and Batch Normalization (BN) [8]. The first group used an ABS layer to capture all the values of noise residual (negative as well as positive values) which might be discarded by some of the activation functions, followed by a convolution layer. Authors claimed a considerable performance over SRM with EC when detecting HILL [11] and S-UNIWARD [6]. Ye et al. [22] proposed a CNN based framework which initializes the first layer of the model with the filters of SRM [3] to better capture the noise residual. They also introduced an activation function named truncated linear unit to capture noise residual with low SNR. Authors reported better performance as compared to SRM [3] for WOW [5], S-UNIWARD [6] and HILL [11] embedding. Tian and Li [18] proposed a CNN based steganalysis using transfer learning. The model used a Gaussian high-pass filter for the preprocessing of images followed by the pre-trained Inception-V3 [17] model for steganalytic classification.

It has been observed from the literature that most of the existing CNN based steganalysis schemes: (i) Sharply increase the feature space by using a sequence of kernels in subsequent layers. (ii) Use some fixed-size kernels (with less variation) which may not be much expressive in learning the stego features since the stego signal is weak and sparse in nature. A kernel with lower spatial dimension may not learn, and a kernel with higher spatial dimension may lead to overfitting. (iii) Use a fully connected layer at the end for classification. The use of fully-connected layers imposes a constraint that the training and testing must be carried out on the images with the same spatial dimension. In order to use images of different sizes, due to the restriction mentioned above, the images must be resized before testing. However, resizing may lead to loss of stego signals, conceptually similar to pooling.

In this paper, considering the shortcomings mentioned above, a densely connected convolution network for steganalysis has been proposed for steganalysis. In contrast to the existing schemes, the proposed scheme makes the following contributions:

  • A densely connected convolutional network without pooling layers is proposed, which progressively captures the steganalytic features at different scales.

  • The fully connected layers are removed, which allows the model to be tested on any size of images regardless of the size of images used for training.

The proposed scheme is trained and tested on BOSSBase 1.0 [2] dataset and the steganalytic performance is compared with SRM [3], SPAM [14] against S-UNIWARD [6], HUGO [15], WOW [5] and HILL [11]. The performance of the proposed scheme is also compared with a recently proposed scheme of Tian and Li [18] against WOW [5] and S-UNIWARD [6].

Fig. 1.
figure 1

The proposed model architecture. The architecture of each block is similar; one of the blocks (Block 1) is also shown in dotted box. Block consists of \(4\times (Conv \rightarrow BN \rightarrow ReLu)\) with sizes indicated for each convolution block.

3 Proposed Work

This section presents the proposed scheme for targeted steganalysis. The proposed model is inspired by DenseNet [7]. The model architecture of the proposed scheme is shown in Fig. 1. The proposed model comprises of an image processing layer followed by four densely connected convolution blocks and a sigmoid layer at the end for classification. Since the steganalytic classifiers are trained on the noise residual instead of the image components, a fixed high-pass filter (HPF) given in Eq. (1) has been used in the image processing layer.

$$\begin{aligned} HPF = \frac{1}{12} \begin{pmatrix} -1 &{}~~2 &{} -2 &{}~~2 &{} -1 \\ ~~2 &{} -6 &{} ~~8 &{} -6 &{}~~2 \\ -2 &{}~~8 &{} -12 &{}~~8 &{} -2 \\ ~~2 &{} -6 &{}~~8 &{} -6 &{}~~2 \\ -1 &{}~~2 &{} -2 &{} ~~2 &{} -1 \end{pmatrix} \end{aligned}$$
(1)

The kernel of the image processing layer is kept fixed and is not updated while training. The noise residual extracted from the image processing layer is used as input to the subsequent dense blocks. The densely connected blocks are used to avoid the problem of vanishing gradients and stego features. Each block is connected to all its subsequent blocks. Consequently, all the blocks receive the feature map from all their preceding blocks. Each block comprises of five convolutional layers. The details of the layers used in each block are given in Table 1. All the blocks have the same configuration except for the last block (Block 4), where the output feature size is \(1\times 512\times 512\). Convolutional layers in each block are followed by the Batch Normalization [8] for faster convergence and the ReLU [12] activation. Pooling layer has not been used since the use of pooling may result in loss of the stego noise. The number of convolutional filters progressively increase as 4, 8, 16, 32 and 64, and the kernel size also increases gradually from \(1\times 1\) to \(5\times 5\) as each block slowly increases the scope of the convolution operator. The different sized kernels help to learn the features at different scales, thereby avoiding the loss of stego signal and capturing more prominent features. The output of the densely connected blocks is a negative residual map which is pixel-wise added to noise residual extracted by the image processing layer to boost the noise components. The resulting output is used as input to the classification layer (sigmoid layer). The classification layer determines whether the input image is a stego or cover image by using the mean sigmoid over entire pixels. The whole framework is trained by minimizing the cross-entropy loss given in Eq. (2).

$$\begin{aligned} L = - \sum _{\forall x} p(x).\log (q(x)) \end{aligned}$$
(2)

where p(x) and q(x) denotes the true and estimated distributions respectively, over a discrete variable x.

Table 1. Details of the layers in each dense block

4 Implementation Details and Results

4.1 Experimental Setup

The experiments are carried out on BOSSBase v1.0 dataset [2]. The dataset consists of 10,000 cover images of size \(512 \times 512\). The steganographic embedding algorithmsFootnote 1 S-UNIWARD [6], HUGO [15], WOW [5] and HILL [11] are used to obtain stego images. Further, 10000 cover-stego pairs of images are divided into training: 5000, validation: 1000 and testing: 4000 cover-stego pairs. To compare the performance with the proposed model SRM [3] and SPAM [14] are implemented along with Ensemble Classifier v.2.0 [10], with the same split (5000 pairs for training and 5000 pairs for testing) as for proposed model. The proposed model is trained using Pytorch [13] on a standard workstation having NVIDIA Quadro M-4000 GPU (8 GB) for 90 epochs. The learning rate is initially set to 0.001 and decays by a factor of 10 every 30 epochs. The batch size is empirically kept as 8 (4 cover and 4 stego). Adam Optimizer [9] is used to optimize the proposed network parameters when training.

Table 2. Steganalytic classification accuracy (in %) of the proposed scheme is compared to SRM [3] with Emsemble classifier [10] and SPAM [14] with Ensemble classifier against S-UNIWARD [6], HUGO [15], WOW [5] and HILL [11].
Table 3. Comparison of the proposed scheme with Tian and Li [18] in terms of steganalytic classification accuracy (in %) against WOW [5] and S-UNIWARD [6].

4.2 Results

The quantitative results for the proposed model are given in Table 2 when compared to the SRM with EC [3] and SPAM [14] with EC [10] against S-UNIWARD [6], HUGO [15], WOW [5], and HILL [11] steganographic schemes with different embedding rates. The results are measured in terms of percentage (%) classification accuracy. The best result is shown in the red color, and the blue color represents the second best result. A series of graphs are also given in Fig. 2 for a visual presentation where the proposed scheme is shown in red color, SRM with EC [3] is shown in green color and SPAM with EC [14] is shown in blue color. Results are evident that the proposed scheme outperformed SRM [3] as well as SPAM [14] for most of the steganographic algorithms. The steganalytic performance of the proposed scheme is also compared with a recent work by Tian and Li [18], which has the same experimental setup in their work as the proposed scheme. The comparison is done against WOW [5] and S-UNIWARD [6] on embedding rates - {0.1, 0.3, 04} bits per pixel (bpp). The results are given in Table 3, the best result is shown in red color, and the next best is shown in blue color. The proposed scheme has comparable performance against WOW [5] on 0.1 bpp, and for the rest of steganographic embedding and payloads, the proposed scheme clearly outperformed Tian and Li [18].

Fig. 2.
figure 2

Steganalytic performance comparison of the proposed scheme ( ) with SRM with EC ( ) and SPAM with EC ( ) against: (a) S-UNIWARD (b) HUGO (c) WOW and (d) HILL steganography on embedding rates - {0.1, 0.2, 0.3, 0.4} bpp (Color figure online)

5 Conclusion

In this paper, a densely connected convolution network based steganalysis is presented. The proposed model captures complex dependencies that are more appropriate for steganalysis, and the learned features avoid the loss of stego signals. The proposed model has no fully connected layer which adds advantage that the model can be tested on any size of the image unlike with fully-connected layers where the image size used for training and testing must be same. The steganalytic performance of the proposed scheme is compared with SRM, SPAM with Ensemble Classifier and a recent scheme by Tian and Li against different steganographic algorithm on different embedding rates. The proposed model outperforms the existing schemes with a considerable margin.