1 Introduction

Development of wireless communications has brought countless advantages to our daily lives, albeit sometimes disadvantageous. The issue that lies at the heart of modern communications is the absence of high-level security. Hence, cryptography schemes have been applied to secure information from unauthorized access or modification; although it can not fulfill all expectations of security. Transmitting meaningless content through communication channels leaves a clue about secret communication; whereas we sometimes aim to hide the existence of confidential information. In these cases, steganography approaches are employed to cover communication with guiltless-looking media.Footnote 1

On the other hand, steganalysis approaches have been developed to detect the existence of confidential information in a suspicious media. These algorithms receive the suspicious media as input and classify it under the subject of clear media (without secret message) or dirty media (containing confidential information). The steganalyzer is sometimes assumed to know the exact steganography algorithm which might be applied for hiding information, or have some partial information about the steganography algorithm, or even is completely ignorant of the steganography method. Based on this available information, steganalysis approaches lie in two main categories: specific (targeted) and blind (universal) steganalysis. Specific steganalysis approaches have been designed to detect a particular steganography method, while blind steganalysis approaches have been developed to detect a group of steganography algorithms without having any detailed knowledge of the embedding strategy. Besides, quantitative steganalysis refers to the eavesdropper’s efforts to estimate the embedding rate or equivalently the length of the confidential message embedded in a host media [25, 35]. For that to be possible, full information about the steganography algorithm is required. Since obtaining detailed information about the steganography scheme is somewhat optimistical, blind steganalysis approaches are of great importance.

Among all steganography hosts including image, audio, video, network protocol packets, etc., because of its high capacity and information redundancy, videos are regarded as a suitable steganography host for embedding high volume secret messages. Nowadays, in order to decrease the cost of transmission and required storage space, all types of digital media are compressed. Lossy compression algorithms lead to some indeterministic components in the output media. These components are ideal covers for confidential information. Thus steganography is often performed during compression. Using more appropriate motion vectors for manipulation, better altering methods, and more proper video compression standards, MV-based steganography approaches have seemingly dominated MV steganalysis methods.

Video coding entails particular statistics in the motion vectors:

  1. a)

    The MVs belonging to the neighboring blocks in a coded frame are highly correlated. This correlation exists because (i) neighboring blocks are likely to belong to the same rigid object or static background (for more explanation, please refer to Fig. 2 in [20]), and (ii) the cost function employed during motion estimation encourages the MVs to be close together.

  2. b)

    The lossy coding stage during motion compensation can modify the statistics of video and shift the MVs from locally optimal to non-optimal. A majority of MVs, nevertheless, remain locally optimal from the receiver’s point of view.

Exploiting the aforementioned facts, we propose a spatio-temporal steganalysis feature extraction method to take full advantage of the clues that MV-based video steganography leaves on the statistics of a video. The rest of this paper is organized as follows. In Section 2, we position our work in literature by reviewing the related work on video steganography and steganalysis. Section 3 reviews the basic concepts of motion estimation algorithm. We then detail motion estimation and compensation during video encoding in Section 3. The proposed method is then described in Section 4 which includes the following contributions:

  1. 1)

    We propose a novel MV-based steganalysis approach taking advantage of the complementary spatio-temporal features to capture the clues that MV-based steganography leaves on video statistics.

  2. 2)

    The comparisons using four targeted MV-based steganography approaches reveal that the proposed approach outperforms its four MV-based steganalysis rivals. Indeed, our method indicates a dramatic improvement in the reliability of MV-based steganalysis methods.

  3. 3)

    We have evaluated the effect of different compression settings, namely motion estimation algorithm and quantization parameter on detection reliability. The experimental results confirm the stability of our features under various settings. Moreover, experiments prove that our features’ performance is even robust against very low embedding rates.

  4. 4)

    The proposed approach can detect the state-of-the-art MV-based steganography methods to a great extent blindly, i.e., without requiring any side information about the steganography approach, embedding rate, and motion estimation algorithm. The proposed method exploits the joint spatial-temporal features of MVs to reach better performance.

  5. 5)

    Unlike the competitor steganalysis approaches, our features can even be extracted in the case of variable-block-size motion estimation. Besides, the proposed feature extraction approach is compatible with all existing video compression standards.

In Section 5, the experimental settings are explained and the experimental results are illustrated to confirm the superiority of our proposed method in both laboratory and real-world conditions. Since the H.264/AVC algorithm is still one of the most efficient compression algorithms concerning compression efficiency, coding speed, and prevalence, without loss of generality, the H.264/AVC baseline compression standard is applied in experiments. Finally, the conclusion is presented in Section 6.

2 Related work on video steganography and steganalysis

Video steganography methods can be divided into two main categories: inter-frame and intra-frame steganography methods. Intra-frame methods manipulate each video frame individually, and regardless of dependencies among frames [12, 22, 23, 31, 36, 56]. Except for capacity, intra-frame methods have no better performance than image steganography methods in the case of steganography criteria. These methods can also be revealed by image steganalysis attacks. On the contrary, inter-frame video steganography methods aim to take advantage of the temporal correlations among frames. These schemes include manipulating DCT coeficients [24, 28, 33, 62], embedding on quantization parameters [41, 50] or variable length codes [30], changing inter-prediction modes [26, 60] or motion vectores [1, 4, 5, 7, 13, 34, 55, 57, 58, 61].

A steganography algorithm is reliable, as long as it can remain undetectable against all existing steganalysis attacks. Accordingly, security is the main criterion of steganography. Since steganalysis attacks are accessible by the transmitter, steganography algorithms can be tested against them to determine whether they are trustworthy to employ for embedding or not. There exist a lot of measurements to demonstrate to what extent a steganography algorithm is secure, such as detection accuracy, ROC (Receiver Operation Characteristics) curve, and AUC (Area Under the Curve).

Motion vectors seem to be the best element of video coding to be employed to hide information for several reasons: First, MV-based video steganography leads to indirect and complicated changes in inter-frame and intra-frame statistics of the video. Besides, experiments have shown that much lower similarities exist among neighboring MVs in comparison with neighboring pixels (for more information, the reader is requested to refer to Fig. 1 and Fig. 2 in [43]). As a result, the detection complexity of MV-based steganography is higher than that of other methods, and MV alteration is the most robust strategy against steganalysis attacks [2]. Besides, due to the motion compensation step, MV manipulations do not cause perceptible degradation in the visual quality of the output video. Moreover, video steganalysis methods that model the embedding procedure as an additive noise cannot detect the presence of the message in MVs [7]. Hence, MV altering has been the most preferred video steganography strategy.

MV-based video steganography methods deal with two fundamental problems: (i) choosing MVs that after modification are as undetectable as possible, and (ii) designing a modification algorithm that leads to least changes in the statistics of the output video. Accordingly, MV-based steganography approaches have experienced three phases of progression [58]. In the first stage, MVs with large prediction error or magnitude were supposed to be the best cloak for confidential information, and the message was embedded in the magnitude or the phase of MVs [1, 13, 55, 61]. Because of selecting non-suitable MVs for modification and applying improper embedding algorithms, the aforementioned methods failed to preserve the statistical characteristics of the original video. In other words, embedded information by these methods is easily detectable by early generations of MV-based steganalysis approaches [6, 42].

It is obvious that more modifications with a particular embedding algorithm raise the detection probability. Hence in the second stage, Syndrome-Trellis Codes (STC) [4, 5, 15], Wet Paper Codes (WPC) [4, 7, 18, 19], and BCH codes [32] were introduced and applied to improve the embedding efficiency (number of embedded bits per modification [11]); this results not only in higher security, but also in improved imperceptibility. Based on the idea that MVs manipulations result in shifting the MVs from locally optimal to non-optimal, “Reversion Based features” [6] and “AoSo features” [48] have been introduced. Besides, the authors of [51] have proposed a high-dimensional feature set considering the correlations between each macro-block and its neighbors. In order to provide a higher level of security against steganalysis attacks, Cao et al. [5] proposed to select the most uncertain MVs for embedding. To provide robust steganalysis features against the second-generation steganography methods, Yao et al. [57] suggested a cost function based on the relationships between the MV of each macroblock and its neighbors.

Due to information loss in the motion compensation phase, some altered MVs are locally optimal after reconstruction at the receiver’s side. The methods which take advantage of this fact have formed the third stage of development in MV-based steganography [4, 20, 58]. To detect more subtle changes in the statistics of MVs after embedding, Zhang et al. [59] have proposed “Near-Perfect Estimation for Locally Optimality features” which exploits local optimality of motion vectors according to the Lagrangian multiplier applied during compression.

Taking all MV-based steganalysis methods into consideration, [6, 48, 59] have proved to provide the strongest MV-based steganalysis features ([39] has recently suggested an entropy-based feature set, the results of which are fairly similar to that of [59]). However, even these approaches cannot detect the currently best steganography methods (e.g., [4, 7, 20, 58]). In the following, the two state-of-the-art methods that inspired the proposed method will be described in details.

2.1 Near-perfect steganalytic features

Based on the assumption that an overwhelming majority of motion vectors are locally optimal w.r.t the Lagrangian multiplier from the receiver’s perspective, Zhang et al. [59] have proposed a 36-D steganalysis feature-set. This feature set called NP estimation features is exploited using each decompressed MV, its eight neighbors, and their corresponding SAD (Sum of Absolute Differences) based and SATD (Sum of Absolute Transposed Differences) based Lagrangian costs. NP estimation features consist of four types of features, each type containing nine dimensions as follows:

  • Feature Set 1: The jth feature of type 1 is defined as the probability that the SAD-based Lagrangian cost of jth MV position is minimum.

    $$ \begin{array}{@{}rcl@{}} f_{1}^{SAD}({j})&=&\frac{1}{N}\sum\limits_{{k}=1}^{N}{\mu({k},{j})}\\ {j}&=&(1,2,...,9) \end{array} $$
    (1)
    $$ \mu({k,j}) = \delta(\operatornamewithlimits{arg\ min}_{mv}[J_{{b_{k}}, MV}^{SAD}],{mv_{j}(b_{k})}) $$
    (2)

    In (1), N is the number of blocks in a GOP (Group Of Pictures) including M P-frames. In (2), \(J_{{b_{k}}, MV}^{SAD}=\{J_{{b_{k}}, mv_{0}}^{SAD},J_{{b_{k}}, mv_{1}}^{SAD}, ..., J_{{b_{k}}, mv_{8}}^{SAD}\}\). Besides, δ(a, b) (for any arbitrary value of a and b) is equal to 1 if a = b, and equal to 0 if ab. As shown in Fig. 5, mv0(bk) refers to the decdeded motion vector for the block bk, and mv1 − 8(bk) are the eight closest neighboring MVs to the decoded MV. In this method, the closest distance (r in the Fig. 5) is set to one.

  • Feature Set 2: The jth feature of type 2 is defined as the exponentially magnified SAD between the cost of jth position and the minimum Lagrangian cost.

    $$ f_{2}^{SAD}({j})=\frac{1}{Z}\sum\limits_{{k}=1}^{N}{exp\left\{\left|\frac{J_{{b_{k}}, MV}^{SAD}(i)-min(J_{{b_{k}}, MV}^{SAD})}{J_{{b_{k}}, MV}^{SAD}(j)}\right|\right\}}.\mu{(k,j)} {j}=(1,2,...,9) $$
    (3)
    $$ Z=\sum\limits_{{j}=1}^{9}\sum\limits_{{k}=1}^{N}{exp\left\{\left|\frac{J_{{b_{k}}, MV}^{SAD}(i)-min(J_{{b_{k}}, MV}^{SAD})}{J_{{b_{k}}, MV}^{SAD}(j)}\right|\right\}}.\mu{(k,j)} $$
    (4)

Feature sets 3 (\(f_{1}^{SATD}\)) and 4 (\(f_{2}^{SATD}\)) are similar to feature sets 1 and 2 respectively, with the only difference that these features are obtained by applying SATD instead of SAD.

There are three major drawbacks to this approach. First and foremost, as indicated in Table 1, local optimality of MVs according to the Lagrangian multiplier on the transmitter’s side does not guarantee that they are locally optimal from the receiver’s point of view; although in most of the video frames the number of locally optimal MVs on the receiver’s side is greater than the number of non-locally-optimal MVs. Secondly, in this approach, the MV’s nearest neighbor differs one unit to the original one; whereas the nearest MV should be adapted to the motion estimation resolution. It should not be left unmentioned that correlations of MVs in each frame are not considered in this feature set; while there are significant correlations between nearby MVs.

Table 1 The average percentage of non-locally-optimal motion vectors according to the lagrangian multiplier from the receiver’s point of view using the H.264/AVC standard for QCIf sequences

2.2 Improved steganalysis features

As depicted in Fig. 1, in [51], a feature set is defined by taking 20 possible combinations of the MVs of each macro-block and its two neighbors into account. Therefore, 9 × 9 = 81 features are introduced for each distribution as follows. First, the difference between the central MV and each of its two neighbors is calculated. The difference value can be one of the members of the set {− 4,− 3,− 2,− 1,0,1,2,3,4}. Any value larger than 4 and smaller than − 4 is rounded to 4 and − 4, respectively. Next, an 81-dimensional feature set is formed using joint differences of each MV and its two neighbors. Finally, 81 × 20 features are computed combining 20 possible distributions. If we want to consider temporal correlations of MVs, we can use two reference frames to add 4 × 20 × 81 more features.

Fig. 1
figure 1

Various distributions of one central macro-block and its two neighbors [51]. MVC is the current block, and MV1 and MV2 are its two neighbors

The major drawback of the aforesaid method is its relatively high dimensions, which may lead to the curse of dimensionality. There are also two other weaknesses in this method. First, it does not support video compression standards with sub-pixel accuracy; so it needs refinement to be proportionate to the common compression standards. Second, the features are exploited based on this assumption that MVs are computed using fixed-size macro-blocks; while in recent standards, MVs are computed based on variable-sized blocks to obtain a better compression ratio. Therefore, this algorithm is not implementable in standards with recent motion estimation algorithms.

3 General theory: motion estimation and compensation

Compression algorithms have been developed to reach faster and cheaper transmission as well as reducing the required storage space. These days, the H.264/AVC algorithm is still one of the most common compression standards. In this algorithm, each P-frame is compressed using one reference frame. The P-frame is partitioned into non-overlapping macro-blocks containing 16 × 16 pixels. There are four decision modes for each macro-block: full (16 × 16), vertical (the macro-block is divided into two partitions of size 8 × 16), horizontal (the macro-block is divided into two partitions of size 16 × 8), and quadruple (the macro-block is divided into four partitions of size 8 × 8). For each decision mode, the optimal MV per partition is obtained based on a cost function (8) using a predetermined Lagrangian multiplier (5). Supposing an exemplary partitioned block (bk), the corresponding optimal MV (\(MV_{b_{K}}\)) is obtained using (9).

$$ \lambda_{ME}=\sqrt{\lambda_{mode}} $$
(5)
$$ \lambda_{mode}=0.85\times2^{(QP-12)/3} $$
(6)
$$ SAD_{b_{k},mv}=\overset{X(b_{k})}{\overset{+BS_{x}(b_{k})}{\sum\limits_{x=X(b_{k})}}} \overset{Y(b_{k})}{\overset{+BS_{y}(b_{k})}{\sum\limits_{y=Y(b_{k})}}}|F^{Org}_{x,y,t}-F^{Rec}_{x+mv_{x},y+mv_{y},t-1}| $$
(7)
$$ J_{b_{k},mv}=SAD_{b_{k},mv}+\lambda_{ME}\times R_{b_{k},mv} $$
(8)
$$ MV_{b_{k}}=\underset{mv}{\arg\ min}[J_{b_{k},mv}] $$
(9)

In (8), \(R_{b_{k},mv}\) is the number of required bits to transmit the candidate MV, and QP in (6) is quantization parameter. Besides, \(SAD_{b_{k},mv}\) in (7) is the sum of absolute differences between the pixels of the current block in the original P-frame (FOrg) and the corresponding block of mv in the reconstructed reference frame (FRec). Afterward, the optimal partitioning mode (ModeOpt) is chosen using (10). If the chosen mode is quadruple, the mode decision algorithm is implemented again on each 8 × 8 pixels partition (full (8 × 8), vertical (4 × 8 ), horizontal (8 × 4), and quadruple (4 × 4)) to select a mode with minimum cost (Fig. 2).

$$ Mode_{Opt}=\underset{mode}{arg\ min}[SSD_{b_{k},MV,mode}+\lambda_{mode}\times cfr_{mode}] $$
(10)
Fig. 2
figure 2

Left: Partitioning mode evaluation during P-frame encoding. Right: Tree structured motion compensation for H.264/AVC [37]

In (10), cfrmode is the final output bitrate of the macro-block based on the candidate partitioning mode, and SSDbk, MV, mode is the sum of squared differences between the original and the reconstructed (16 × 16) or (8 × 8) block [37].

4 Proposed steganalysis method

Overview

Figure 3 illustrates the block diagram of the proposed steganalysis feature extraction method based on the MVs’ spatio-temporal features termed as MVST.

figure a
Fig. 3
figure 3

Block diagram of the proposed method

The proposed features are designed to address the shortcomings of the predecessors mentioned in Section 2. We aim to extract a 54 − D steganalysis feature vector per every M consecutive P-frames (each GOP) containing N motion vectors (variable partition size which is allowed by the baseline profile of the H.264/AVC is considered in this study). The steganalysis feature vector consists of a 36 − D spatial feature set and an 18 − D temporal feature set. Using the following scheme, we extract spatial and temporal features and finally concatenate them together. Spatial features for each motion vector will be updated based on its differences with the MVs of eight neighboring partitions of its corresponding block. Temporal features will be updated based on the local optimality conditions of each MV by taking its reconstructed reference frame into account as described in [59].

4.1 Frame decoding

As the first step, the current frame and its reference frame are decoded and reconstructed using the input bitstream.

4.2 Partitioning mode extraction

For each motion vector, the partitioning mode is obtained during the process of decoding. The bitstream of partitioning mode is the first part of each MV’s bitstream, which can be extracted by Golombdecoding (more details on the bitstream of H.264/AVC can be found in [40]). If the extracted number (Mode1) is smaller than three, we set s = 16 and Mode = Mode1. If the extracted number is equal to three, we set s = 8 and apply Golombdecoding on the remaining bitstream to find Mode = Mode2. Subsequently, the partitioning mode is obtained based on (11) in which BSx and BSy are equivalent to the width and the height of the existing block, respectively (Fig. 2 illustrates the partitioning mode evaluation during P-frame encoding.).

$$ \left\{ \begin{array}{lcl} \begin{cases}BSx=s\\BSy=s\end{cases} & \text{if} & Mode=0 \\ \begin{cases}BSx=s/2\\BSy=s\end{cases} & \text{if} & Mode=1 \\ \begin{cases}BSx=s\\BSy=s/2\end{cases} & \text{if} & Mode=2 \end{array}\right. $$
(11)

4.3 Steganalysis feature updating

4.3.1 Spatial features extraction

Extracting the MVs of neighboring pixels

After decoding a P-frame completely, the MV of each pixel in the frame is determined. Since variable-sized blocks are allowed in recent video compression standards, it is possible that the corresponding pixels of each neighboring sub-block have different MV values (Fig. 4, left). Hence, in the proposed method, one position in each neighboring block is selected as the reference pixel. As illustrated in Fig. 4, supposing the situation of the pixel in the top-left corner of the central block is (i0, j0), the neighboring MVs are evaluated as follows (Fig. 4, right):

$$ \begin{array}{@{}rcl@{}} MV_{1} & = & MV(i_{0}-1, j_{0}-1) \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} MV_{2}& = &MV(i_{0}, j_{0}-1) \\ MV_{3} & = &\ MV(i_{0}+BS_{x}, j_{0}-1)\\ MV_{4}& = & MV(i_{0}, j_{0})\\ MV_{5} & = & MV(i_{0}+BS_{x}, j_{0})\\ MV_{6} & = & MV(i_{0}-1, j_{0}+BS_{y})\\ MV_{7} & = & MV(i_{0}, j_{0}+BS_{y})\\ MV_{8} & = & MV(i_{0}+BS_{x}, j_{0}+BS_{y}) \end{array} $$
Fig. 4
figure 4

Left: an example of existing pixels with different motion vectors in a neighboring block. Right: eight neighboring pixels of each sub-block, the corresponding MVs of which are exploited to form the spatial features

Calculating the spatial features

The rough idea behind this feature set is inspired by [51]. The difference between each of eight aforementioned MVs and the MV of the central block (MV0) is calculated. Afterwards, as indicated in (13), the features related to the horizontal differences between MVs (fh(K, D)) and vertical differences between MVs (fv(K, D)) are computed (ρ ∈{h, v}). In the mentioned equations, K ∈ [1,8] and T is the truncation threshold. The mentioned features are designed to capture the correlations between motion vectors corresponding to a static background or a same rigid object. The greater the difference between the MV corresponding to the current block and a neighboring pixel is, the less probable the two MVs are correlated together. Accordingly, we have considered the absolute difference of one (T = 1) as the upper bound of correlation (least correlation value). Therefore, each difference value greater than 1 and smaller than − 1 is rounded to 1 and − 1, respectively. Also Scale = 1/R (R is motion vector resolution and equivalent to 0.25 in experiments), and δ{a, b} = 1 if a = b; otherwise, δ{a, b} = 1. Finally, we will have a 9-D feature set per each horizontal or vertical neighbor. Combining features of all eight neighbors, we will have a vector containing 8 × 2 × 9 features.

$$ \begin{array}{@{}rcl@{}} f^{{\rho}}(\mathit{K,D} + 5)\!&=&\! \left\{ \begin{array}{lcl} P((MV_{0}^{{\rho}}-MV_{K}^{{\rho}})\leq D/Scale) & \text{if} & D=-T\\ P((MV_{0}^{{\rho}}-MV_{K}^{{\rho}})=D/Scale)& \text{if} & -(T-1)\leq D\leq (T-1) \\ P((MV_{0}^{{\rho}}-MV_{K}^{{\rho}})\geq D/Scale)& \text{if} & D=T \end{array}\right.\\ \!&=&\! \left\{ \begin{array}{lcl} \frac{1}{N}\sum\limits_{n=1}^{N}\sum\limits_{i=-\infty}^{-D}\delta((MV_{0}^{{\rho}}(n)-MV_{K}^{{\rho}}(n)),i/Scale) & \text{if} & D=-T\\ \frac{1}{N}\sum\limits_{n=1}^{N}\delta((MV_{0}^{{\rho}}(n)-MV_{K}^{{\rho}}(n)),D/Scale)& \text{if} & -(T - 1)\leq D\leq (T-1)\!\!\!\\ \frac{1}{N}\sum\limits_{n=1}^{N}\sum\limits_{i=D}^{+\infty}\delta((MV_{0}^{{\rho}}(n)-MV_{K}^{{\rho}}(n)),i/Scale)& \text{if} & D=T \end{array}\right.\\ \end{array} $$
(13)

4.3.2 Temporal features extraction

The MVs belonging to all blocks are extracted during the process of decoding the current P-frame. Supposing an exemplary block (bk) in the current P-frame, the decoded MV belonging to this block (MV0(bk)) determines its corresponding reference block in the reference frame (Fig. 5, right). In case any confidential information is embedded in this MV (for instance, one bit of a secret message), the decoded MV might differ from the original MV. The transmitter tries to apply the slightest possible changes to the MVs during embedding to leave as smallest clues as possible. Hence, the confidential information should have been embedded by replacing the original MV with one of its nearest MVs (Fig. 5, left). Notwithstanding the loss of information during motion compensation, the original MV that should be one of the closest MVs to the decoded MV is more likely to be the optimal MV compared to the other MVs on the receiver’s side. Accordingly, we compute and compare the optimality of these MVs using the feature set introduced in [59]. However, we have modified this feature set to be compatible with various video compression standards including H.264/AVC. Indeed, instead of using MVs with one unit difference, we use MVs demonstrated in Fig. 5 in which h and v are horizontal and vertical components of the decoded MV, and r is equivalent to the smallest possible changes in MVs during motion estimation (“ME-Resolution”).

Fig. 5
figure 5

Left: the spatial position and corresponding MVs of blocks in the reference frame used to evaluate the temporal features. Right: illustration of the reference block in the reference frame considering the decoded motion vector and the position of the current block

4.4 Features’ dimensionality reduction

Regarding the fact that high dimensions of features lead to (i) requirement to a very big training set, (ii) increasing the probability of classifier overfitting, and (iii) curse of dimensionality, we suggest a dimensionality reduction stage. We reduce the dimensions of spatial features from 180-D to 36-D, and reduce the dimensions of temporal features from 36-D to 18-D by combining the correlated features.

We compute the horizontal (\(f_{H}^{{\rho }}\)), vertical (\(f_{V}^{{\rho }}\)), right diagonal (\(f_{RD}^{{\rho }}\)), and left diagonal (\(f_{LD}^{{\rho }}\)) spatial features as the average of features corresponding to two neighboring pixels as follows:

$$ \begin{array}{@{}rcl@{}} f_{H}^{{\rho}}=(f^{{\rho}}(4,D)+f^{{\rho}}(5,D))/2 \\ f_{V}^{{\rho}}=(f^{{\rho}}(2,D)+f^{{\rho}}(7,D))/2\\ f_{RD}^{{\rho}}=(f^{{\rho}}(3,D)+f^{{\rho}}(6,D))/2 \\ f_{LD}^{{\rho}}=(f^{{\rho}}(1,D)+f^{{\rho}}(8,D))/2 \end{array} $$
(14)

In order to further reduce the spatial feature’s dimensions, we sum up the features corresponding to the horizontal and vertical components of MVs, and obtain the 36-D spatial features (fs) as (15).

$$ {f}^{s}(k,n) = \left\{ \begin{aligned} exp({{f}_{H}^{v}}(n)+{{f}_{H}^{h}}(n)) &if & k=1\\ exp({{f}_{V}^{v}}(n)+{{f}_{V}^{h}}(n)) &if & k=2\\ exp({f}_{LD}^{v}(n)+{f}_{RD}^{h}(n)) &if & k=3\\ exp({f}_{RD}^{v}(n)+{f}_{RD}^{h}(n)) &if &k=4 \end{aligned}\right. $$
(15)

To reduce the dimensionality of temporal features, we sum up the SAD and SATD based features to obtain the 18-D temporal features as follows:

$$ f^{t}(n) = \left\{ \begin{aligned} f_{1}^{SAD}(n)+f_{1}^{SATD}(n) &1\leq n\leq 9 \\ f_{2}^{SAD}(n)+f_{2}^{SATD}(n) &10\leq n\leq 18 \end{aligned}\right. $$
(16)

4.5 Spatio-temporal feature concatenation

Finally as shown in Fig. 6, combining 36-D spatial (fs) and 18-D temporal (ft) features, we obtain a 54-dimensional feature set using (17).

$$ F(n) = \left\{ \begin{aligned} f^{s}(\lceil \frac{n}{9}\rceil,n) &1\leq n\leq 36 \\ f^{t}(n-36) & 37\leq n\leq 54 \end{aligned}\right. $$
(17)
Fig. 6
figure 6

Spatio-temporal features concatenation

5 Experiments

5.1 Experimental Settings

5.1.1 Database

Figure 7 shows the first frame of 22 PAL QCIf video sequences (192 × 144 pixels) without prior compressionFootnote 2 being used to construct the database. These sequences are downloaded from [54]. The selected sequences consist of a wide range of videos concerning diversity in the texture of video, objects’ motion, camera movement, and the type of background. Because of containing different numbers of frames, all of the video sequences are divided into non-overlapping 60-frame sub-sequences, and utmost five 60-frame sub-sequences of each sequence are utilized for experiments. Totally, 84 video sub-sequences are used for training.

Fig. 7
figure 7

The first frame of 22 QCIf video sequences which are applied in experiments

5.1.2 Video compression method

Because of its wide use and effectiveness, the H.264/AVC baseline profile is employed for video compression. Two different motion estimation algorithms are applied in this test: Exhaustive Search (FULL) and Hexagon-based Search (HEX) [63]. The search range is set to 8 pixels, and the motion estimation resolution is quarter-pixel. Also, three different quantization parameters (QP ∈{17,27,32}) are considered.

5.1.3 Competitor Steganalysis methods

To demonstrate the effectiveness of our proposed method, we compare its results with the results of steganalyzers MVRBF [6], AoSO [48], NPELO [59], which are the best video steganalysis methods against MV-based video steganography up to now. Meanwhile, in [6] features are extracted using macro-blocks; whereas, in the H.264/AVC algorithm, each macro-block may contain some sub-blocks. To adapt this method to the H.264/AVC algorithm, the features are extracted using sub-blocks. Also, these algorithms are not adjusted to video compression standards that support motion estimation with sub-pixel accuracy. As a result, they are not capable of detecting MV-based steganography methods which are applied to such compression standards. We had to refine them to compare their efficiency with our proposed method. Hence, the above schemes are revisited and all features are extracted with sub-pixel accuracy.

5.1.4 Steganography targets

To the best of our knowledge, TAR1 [5] and TAR4 [7] from the second generation, and TAR2 [58] and TAR3 [4] from the third generation are the best MV-based steganography methods until now. Thus these schemes are used in the experiment. Meanwhile, in [5] the set of Lagrangian multipliers used to evaluate MVs’ embedding cost function is λ = [0,2,4,6,8] and we set b = − 2 and α = 0.5 for the distortion function. Also, we set h = 8 for the syndrome-trellis coder used in all methods (Syndrome-trellis codes are downloaded from [16]).

5.1.5 Embedding

For each steganography method and each sequence, a random message with uniform distributionFootnote 3 and rates ER ∈{0.1,0.2,0.3} per MV is produced and embedded.

5.1.6 Training and classification

We split all embedded and compressed videos into 12-frame sub-sequencesFootnote 4, and a feature vector is obtained using each of these sub-sequences. The MATLAB’s SVM toolbox is used to train each steganalyzer applying Gaussian and polynomial kernels, and the best figures are listed.

5.1.7 Evaluation criteria for steganalysis performance

There are three major metrics to evaluate the security level of a steganography scheme against steganalysis attacks: Detection Accuracy, Receiver Operation Characteristics (ROC), and Area Under the Curve (AUC) which is also called Detector Reliability.Footnote 5 Since detector reliability provides better comparative information than others, we use this metric in experiments [2, 11, 14, 29]. Furthermore, in order to compare the discrimination capability of each steganalyzer, the detector reliability (AUC) is evaluated under different settings.

5.1.8 Performance evaluation setups

Steganalysis algorithms are mostly evaluated assuming some side information is available on the warden’s side. These pieces of information include steganography scheme, details about the compression algorithm which is not retrievable on the receiver’s side, the embedding rate, and the original cover. These are conditions that can be provided in the laboratory, while in the real world such information is inaccessible [27, 49]. To prove the effectiveness and detection capability of the proposed method in various conditions, we apply steganalysis approaches on four following setups:

  • Setup 1) Complete laboratory conditions: In this scenario, we suppose that side information about the type of motion estimation algorithm (e.g., FULL or HEX), embedding rate per MV, and the steganography scheme is available. Accordingly, different steganography schemes with different motion estimation algorithms and embedding rates are separately trained and classified.

  • Setup 2) Unknown ME algorithm: This scenario is according to the assumption that the warden is unaware of the ME algorithm, but has knowledge about the steganography algorithm and embedding rate. Therefore, experiments are carried out on a combination of video sequences compressed and embedded by employing full search (FULL) or fast search (HEX) algorithm.

  • Setup 3) Unknown ME algorithm and embedding rate: In this scenario, it is assumed that the steganalyzer has just side information about the steganography algorithm. Therefore, video sequences are grouped based on the type of steganography algorithm and each group contains sequences with different ME methods and embedding rates.

  • Setup 4) Real-world conditions: This scenario aims to address the reliability of the proposed manner under realistic conditions (worst conditions). We assume that the warden is completely ignorant of the ME method, embedding rate, and even steganography scheme. So steganalysis tests are performed exploiting a mixture of video sequences with various settings.

    In the last two setups, embedded sequences with various settings are randomly selected and fed into the classifier to obtain detector reliability.Footnote 6 This stage is repeated 30 times and detector reliability is the average value of them.

5.2 Experimental results

  1. (i)

    Setup 1: In this setup, granting detailed information about the steganography algorithm, quantization parameter, ME algorithm, and embedding rate to the wardens, we aim to measure the reliability of detectors under various settings. Therefore, sequences grouped based on their properties are subjected to the proposed and rival detectors. The corresponding results are compared in Table 2.

    Table 2 Detector reliability of the proposed blind steganalyzer (MVST), NPELO, AoSO, and MVRB against TAR1-4, using Setup1 with different motion estimation algorithms (ME), embedding rates (ER), and quantization parameters (QP)

    As it can be seen, all of the detectors have shown an acceptable performance against TAR1, especially the proposed algorithm and MVRB. These considerable results of the MVRB originate from the weakness of TAR1, which is the selection of embedded MVs based on SAD and without taking the Lagrangian cost into account. consequently, the number of MVs that are optimal concerning SAD is increased after embedding. This increase is sharper when it comes to higher embedding rates and larger quantization parameters (raising the Lagrangian multiplier results in decreasing the influence of SAD on the Lagrangian cost). Since its features are exactly exploited based on SAD, MVRB is successful in tracing TAR1. The performance of two other rivals is negatively influenced when video compression is conducted using a fast search algorithm, while the proposed method has shown a steady level of detection. Indeed, proposed features have proved to reach almost distinctive classes and perfect detection against TAR1.

    In sharp contrast to TAR1, TAR2 has demonstrated acceptable resistance against competitors. The stego sequences produced by this method are, however, overwhelmingly perceptible by our proposed features. The majority of results of AoSO and MVRB are fairly close to random-guessing. Using HEX motion estimation algorithm has resulted in weakening the competitors.

    Table 2 indicates near-complete resistance of TAR3 and TAR4 against AoSO and MVRB in the HEX search conditions. Indeed, these detectors indicate better performance in the FULL search setting. The reason is that both MVRB and AoSO features are exploited based on SAD, and there are more optimal MVs with respect to SAD after motion estimation with the exhaustive search. It can be perceived from the figures that the performance of competitors degrades when a fast motion estimation algorithm is applied. This degradation is far sharper in NPELO, the results of which are as worse as other rivals in the HEX setting. By contrast, the proposed features have maintained a high level of detection against TAR3 and TAR4 in various settings.

    In summation, the results have confirmed that the proposed features are far more reliable than rivals’ features. Furthermore, our detector’s reliability remains relatively stable even at low embedding rates.

  2. (ii)

    Setup 2: In order to specify how far our proposed features are robust against different ME algorithms, training and classification are carried out on a combination of sequences compressed by FULL and HEX search. The results are listed in Table 3.

    Table 3 Detector reliability of the proposed blind steganalyzer (MVST), NPELO, AoSO, and MVRB against TAR1-4, using Setup2 with different embedding rates (ER) and quantization parameters (QP)

    The figures show that the proposed and rival features can detect TAR1 to a high extent, notwithstanding the lack of knowledge about the motion estimation algorithm. Conversely, rivals have no better performance than random-guessing in detecting TAR2 when the payload rate is low (0.1 and less). These schemes have demonstrated better results in tracing TAR3 and TAR4 with the aforesaid setup. Additionally, figures signify that AoSO is slightly inferior to MVRB.

  3. (iii)

    Setup 3: Having knowledge about the embedding rate on eavesdropper’s side seems unrealistic. Hence, the classifier should be trained by a combination of sequences altered with different embedding rates.

    As illustrated in Table 4, the shortage of information about the embedding rate has not affected the reliability of detectors. Surprisingly, TAR1 is blindly detectable by the proposed features. In other words, our detector can distinguish the clean media from the media manipulated using TAR1 to a high extent, without requiring partial information. Likewise, it has achieved dramatic detection rates against TAR2-4. Compared to our approach, AoSO and MVRB have shown to be too weak to trace TAR2-3, with their reliability hovering around 0.53.

    Table 4 Detector reliability of the proposed blind steganalyzer (MVST), NPELO, AoSO, and MVRB against TAR1-4, using Setup3 with different quantization parameters (QP)

    Overall, a noticeable superiority in the results of our method in comparison to rivals can be observed.

  4. (iv)

    Setup 4: In preceding setups, we supposed all detectors as targeted steganalysis methods. In this setup, we attempt to figure out whether the proposed method and opponents can be regarded as universal steganalysis methods or not.

    Table 5 suggests that the performances of rivals are considerably degraded when prior knowledge about the embedding algorithm is not accessible, whereas all of the steganography methods are vulnerable to our proposed features. The figures of all targets imply that these outstanding methods can be easily trapped by the proposed method.

Table 5 Detector reliability of the proposed blind steganalyzer (MVST), NPELO, AoSO, and MVRB against unknown targets, using Setup4 with different quantization parameters (QP)

6 Conclusion

In this paper, using Spatio-temporal statistics of motion vectors, we have proposed, implemented, and evaluated a blind steganalysis method for detecting MV-based steganography algorithms. The proposed approach is designed to boost the performance in MV-based video steganalysis by addressing the shortcomings of the previous approaches. Indeed in contrast to the previous method, the proposed method is (i) capable of jointly utilizing the spatio-temporal statistics of the MVs to improve the detection accuracy, (ii) capable of capturing the subtle statistical clues about MV-based steganography by considering the video codec configuration in the feature extraction stage, (iii) generalized to the different video codec configurations namely variable-block-size and sub-pixel motion estimation, (iv) and less vulnerable to overfitting compared to some rival methods due to low dimension of features thanks to the dimensionality-reduction stage.

Experimental results have shown that the proposed features’ performance has surpassed the prior outstanding MV-based steganalysis schemes. What sets our proposed steganalysis feature extraction method apart from previously proposed ones is its adaptability to various video compression settings and algorithms. On top of that, the proposed features perform relatively stable in different conditions including different steganography methods, ME algorithms, quantization parameters, and even low embedding rates.

In recent video compression standards, Lagrangian based cost functions are applied to increase the compression efficiency. These functions decide based on the number of bits needed for transmitting the MV, and the SAD between the existing and reference block. The effect of MV’s length of the code in the Lagrangian cost is in direct relationship with the Lagrangian multiplier. Therefore, a greater Lagrangian multiplier leads to approaching each MV to its reference MVs, and consequently more correlations among MVs; So it can be inferred that applying smaller quantization parameters causes more resistance against the spatial features. Due to exploiting joint spatio-temporal features, the proposed method has reached the best detection reliability compared to rivals. In real-world conditions where no information about the steganography algorithm, motion estimation method, and embedding rate is available, the proposed features have shown very promising detection results (more than 95% compared to only 65% in case of the best rival). Our approach has shown a remarkable improvement in blind MV-based steganalysis so that the prominent MV-based steganography methods are no longer reliable.

For future work, we aim to consider the scenarios where the confidential information is just embedded in the MVs belonging to deformable objects using object detection and tracking approaches [8,9,10]. Tracking these objects is specifically important since embedding confidential information in their corresponding MVs leads to less evidence of steganography.