Blind MV-based video steganalysis based on joint inter-frame and intra-frame statistics

Despite all its irrefutable benefits, the development of steganography methods has sparked ever-increasing concerns over steganography abuse in recent decades. To prevent the inimical usage of steganography, steganalysis approaches have been introduced. Since motion vector manipulation leads to random and indirect changes in the statistics of videos, MV-based video steganography has been the center of attention in recent years. In this paper, we propose a 54-dimentional feature set exploiting spatio-temporal features of motion vectors to blindly detect MV-based stego videos. The idea behind the proposed features originates from two facts. First, there are strong dependencies among neighboring MVs due to utilizing rate-distortion optimization techniques and belonging to the same rigid object or static background. Accordingly, MV manipulation can leave important clues on the differences between each MV and the MVs belonging to the neighboring blocks. Second, a majority of MVs in original videos are locally optimal after decoding concerning the Lagrangian multiplier, notwithstanding the information loss during compression. Motion vector alteration during information embedding can affect these statistics that can be utilized for steganalysis. Experimental results have shown that our features’ performance far exceeds that of state-of-the-art steganalysis methods. This outstanding performance lies in the utilization of complementary spatio-temporal statistics affected by MV manipulation as well as feature dimensionality reduction applied to prevent overfitting. Moreover, unlike other existing MV-based steganalysis methods, our proposed features can be adjusted to various settings of the state-of-the-art video codec standards such as sub-pixel motion estimation and variable-block-size motion estimation.


Introduction
Development of wireless communications has brought countless advantages to our daily lives, albeit sometimes disadvantageous. The issue that lies at the heart of modern communications is the absence of high-level security. Hence, cryptography schemes have been applied to secure information from unauthorized access or modification; although it can not fulfill all expectations of security. Transmitting meaningless content through communication channels leaves a clue about secret communication; whereas we sometimes aim to hide the existence of confidential information. In these cases, steganography approaches are employed to cover communication with guiltless-looking media. 1 On the other hand, steganalysis approaches have been developed to detect the existence of confidential information in a suspicious media. These algorithms receive the suspicious media as input and classify it under the subject of clear media (without secret message) or dirty media (containing confidential information). The steganalyzer is sometimes assumed to know the exact steganography algorithm which might be applied for hiding information, or have some partial information about the steganography algorithm, or even is completely ignorant of the steganography method. Based on this available information, steganalysis approaches lie in two main categories: specific (targeted) and blind (universal) steganalysis. Specific steganalysis approaches have been designed to detect a particular steganography method, while blind steganalysis approaches have been developed to detect a group of steganography algorithms without having any detailed knowledge of the embedding strategy. Besides, quantitative steganalysis refers to the eavesdropper's efforts to estimate the embedding rate or equivalently the length of the confidential message embedded in a host media [25,35]. For that to be possible, full information about the steganography algorithm is required. Since obtaining detailed information about the steganography scheme is somewhat optimistical, blind steganalysis approaches are of great importance.
Among all steganography hosts including image, audio, video, network protocol packets, etc., because of its high capacity and information redundancy, videos are regarded as a suitable steganography host for embedding high volume secret messages. Nowadays, in order to decrease the cost of transmission and required storage space, all types of digital media are compressed. Lossy compression algorithms lead to some indeterministic components in the output media. These components are ideal covers for confidential information. Thus steganography is often performed during compression. Using more appropriate motion vectors for manipulation, better altering methods, and more proper video compression standards, MV-based steganography approaches have seemingly dominated MV steganalysis methods.
Video coding entails particular statistics in the motion vectors: a) The MVs belonging to the neighboring blocks in a coded frame are highly correlated. This correlation exists because (i) neighboring blocks are likely to belong to the same rigid object or static background (for more explanation, please refer to Fig. 2 in [20]), and (ii) the cost function employed during motion estimation encourages the MVs to be close together. b) The lossy coding stage during motion compensation can modify the statistics of video and shift the MVs from locally optimal to non-optimal. A majority of MVs, nevertheless, remain locally optimal from the receiver's point of view.
Exploiting the aforementioned facts, we propose a spatio-temporal steganalysis feature extraction method to take full advantage of the clues that MV-based video steganography leaves on the statistics of a video. The rest of this paper is organized as follows. In Section 2, we position our work in literature by reviewing the related work on video steganography and steganalysis. Section 3 reviews the basic concepts of motion estimation algorithm. We then detail motion estimation and compensation during video encoding in Section 3. The proposed method is then described in Section 4 which includes the following contributions: 1) We propose a novel MV-based steganalysis approach taking advantage of the complementary spatio-temporal features to capture the clues that MV-based steganography leaves on video statistics. 2) The comparisons using four targeted MV-based steganography approaches reveal that the proposed approach outperforms its four MV-based steganalysis rivals. Indeed, our method indicates a dramatic improvement in the reliability of MV-based steganalysis methods. 3) We have evaluated the effect of different compression settings, namely motion estimation algorithm and quantization parameter on detection reliability. The experimental results confirm the stability of our features under various settings. Moreover, experiments prove that our features' performance is even robust against very low embedding rates. 4) The proposed approach can detect the state-of-the-art MV-based steganography methods to a great extent blindly, i.e., without requiring any side information about the steganography approach, embedding rate, and motion estimation algorithm. The proposed method exploits the joint spatial-temporal features of MVs to reach better performance. 5) Unlike the competitor steganalysis approaches, our features can even be extracted in the case of variable-block-size motion estimation. Besides, the proposed feature extraction approach is compatible with all existing video compression standards.
In Section 5, the experimental settings are explained and the experimental results are illustrated to confirm the superiority of our proposed method in both laboratory and real-world conditions. Since the H.264/AVC algorithm is still one of the most efficient compression algorithms concerning compression efficiency, coding speed, and prevalence, without loss of generality, the H.264/AVC baseline compression standard is applied in experiments. Finally, the conclusion is presented in Section 6.
A steganography algorithm is reliable, as long as it can remain undetectable against all existing steganalysis attacks. Accordingly, security is the main criterion of steganography.
Since steganalysis attacks are accessible by the transmitter, steganography algorithms can be tested against them to determine whether they are trustworthy to employ for embedding or not. There exist a lot of measurements to demonstrate to what extent a steganography algorithm is secure, such as detection accuracy, ROC (Receiver Operation Characteristics) curve, and AUC (Area Under the Curve).
Motion vectors seem to be the best element of video coding to be employed to hide information for several reasons: First, MV-based video steganography leads to indirect and complicated changes in inter-frame and intra-frame statistics of the video. Besides, experiments have shown that much lower similarities exist among neighboring MVs in comparison with neighboring pixels (for more information, the reader is requested to refer to Fig. 1 and Fig. 2 in [43]). As a result, the detection complexity of MV-based steganography is higher than that of other methods, and MV alteration is the most robust strategy against steganalysis attacks [2]. Besides, due to the motion compensation step, MV manipulations do not cause perceptible degradation in the visual quality of the output video. Moreover, video steganalysis methods that model the embedding procedure as an additive noise cannot detect the presence of the message in MVs [7]. Hence, MV altering has been the most preferred video steganography strategy.
MV-based video steganography methods deal with two fundamental problems: (i) choosing MVs that after modification are as undetectable as possible, and (ii) designing a modification algorithm that leads to least changes in the statistics of the output video. Accordingly, MV-based steganography approaches have experienced three phases of progression [58]. In the first stage, MVs with large prediction error or magnitude were supposed to be the best cloak for confidential information, and the message was embedded in the magnitude or the phase of MVs [1,13,55,61]. Because of selecting non-suitable MVs for modification and applying improper embedding algorithms, the aforementioned methods failed to preserve the statistical characteristics of the original video. In other words, embedded information by these methods is easily detectable by early generations of MV-based steganalysis approaches [6,42].
It is obvious that more modifications with a particular embedding algorithm raise the detection probability. Hence in the second stage, Syndrome-Trellis Codes (STC) [4,5,15], Wet Paper Codes (WPC) [4,7,18,19], and BCH codes [32] were introduced and applied to improve the embedding efficiency (number of embedded bits per modification [11]); this results not only in higher security, but also in improved imperceptibility. Based on the idea that MVs manipulations result in shifting the MVs from locally optimal to nonoptimal, "Reversion Based features" [6] and "AoSo features" [48] have been introduced. Besides, the authors of [51] have proposed a high-dimensional feature set considering the correlations between each macro-block and its neighbors. In order to provide a higher level of security against steganalysis attacks, Cao et al. [5] proposed to select the most uncertain MVs for embedding. To provide robust steganalysis features against the second-generation steganography methods, Yao et al. [57] suggested a cost function based on the relationships between the MV of each macroblock and its neighbors.
Due to information loss in the motion compensation phase, some altered MVs are locally optimal after reconstruction at the receiver's side. The methods which take advantage of this fact have formed the third stage of development in MV-based steganography [4,20,58].
To detect more subtle changes in the statistics of MVs after embedding, Zhang et al. [59] have proposed "Near-Perfect Estimation for Locally Optimality features" which exploits local optimality of motion vectors according to the Lagrangian multiplier applied during compression.
Taking all MV-based steganalysis methods into consideration, [6,48,59] have proved to provide the strongest MV-based steganalysis features ( [39] has recently suggested an entropy-based feature set, the results of which are fairly similar to that of [59]). However, even these approaches cannot detect the currently best steganography methods (e.g., [4,7,20,58]). In the following, the two state-of-the-art methods that inspired the proposed method will be described in details.

Near-perfect steganalytic features
Based on the assumption that an overwhelming majority of motion vectors are locally optimal w.r.t the Lagrangian multiplier from the receiver's perspective, Zhang et al. [59] have proposed a 36-D steganalysis feature-set. This feature set called NP estimation features is exploited using each decompressed MV, its eight neighbors, and their corresponding SAD (Sum of Absolute Differences) based and SATD (Sum of Absolute Transposed Differences) based Lagrangian costs. NP estimation features consist of four types of features, each type containing nine dimensions as follows: • Feature Set 1: The jth feature of type 1 is defined as the probability that the SAD-based Lagrangian cost of jth MV position is minimum.
In (1), N is the number of blocks in a GOP (Group Of Pictures) including M P-frames.
Besides, δ(a, b) (for any arbitrary value of a and b) is equal to 1 if a = b, and equal to 0 if a = b. As shown in Fig. 5, mv 0 (b k ) refers to the decdeded motion vector for the block b k , and mv 1−8 (b k ) are the eight closest neighboring MVs to the decoded MV. In this method, the closest distance (r in the Fig. 5) is set to one.
• Feature Set 2: The jth feature of type 2 is defined as the exponentially magnified SAD between the cost of jth position and the minimum Lagrangian cost.
Feature sets 3 (f SAT D

1
) and 4 (f SAT D 2 ) are similar to feature sets 1 and 2 respectively, with the only difference that these features are obtained by applying SATD instead of SAD.
There are three major drawbacks to this approach. First and foremost, as indicated in Table 1, local optimality of MVs according to the Lagrangian multiplier on the transmitter's side does not guarantee that they are locally optimal from the receiver's point of view; although in most of the video frames the number of locally optimal MVs on the receiver's side is greater than the number of non-locally-optimal MVs. Secondly, in this approach, the MV's nearest neighbor differs one unit to the original one; whereas the nearest MV The results are obtained using different quantization parameters (QP) and two motion estimation (ME) methods: Hexagon-based search (HEX) and Full search (FULL) should be adapted to the motion estimation resolution. It should not be left unmentioned that correlations of MVs in each frame are not considered in this feature set; while there are significant correlations between nearby MVs.

Improved steganalysis features
As depicted in Fig. 1, in [51], a feature set is defined by taking 20 possible combinations of the MVs of each macro-block and its two neighbors into account. Therefore, 9 × 9 = 81 features are introduced for each distribution as follows. First, the difference between the central MV and each of its two neighbors is calculated. The difference value can be one of the members of the set {−4, −3, −2, −1, 0, 1, 2, 3, 4}. Any value larger than 4 and smaller than −4 is rounded to 4 and −4, respectively. Next, an 81-dimensional feature set is formed using joint differences of each MV and its two neighbors. Finally, 81 × 20 features are Fig. 1 Various distributions of one central macro-block and its two neighbors [51]. MV C is the current block, and MV 1 and MV 2 are its two neighbors computed combining 20 possible distributions. If we want to consider temporal correlations of MVs, we can use two reference frames to add 4 × 20 × 81 more features. The major drawback of the aforesaid method is its relatively high dimensions, which may lead to the curse of dimensionality. There are also two other weaknesses in this method. First, it does not support video compression standards with sub-pixel accuracy; so it needs refinement to be proportionate to the common compression standards. Second, the features are exploited based on this assumption that MVs are computed using fixedsize macro-blocks; while in recent standards, MVs are computed based on variable-sized blocks to obtain a better compression ratio. Therefore, this algorithm is not implementable in standards with recent motion estimation algorithms.

General theory: motion estimation and compensation
Compression algorithms have been developed to reach faster and cheaper transmission as well as reducing the required storage space. These days, the H.264/AVC algorithm is still one of the most common compression standards. In this algorithm, each P-frame is compressed using one reference frame. The P-frame is partitioned into non-overlapping macro-blocks containing 16 × 16 pixels. There are four decision modes for each macroblock: full (16 × 16), vertical (the macro-block is divided into two partitions of size 8 × 16), horizontal (the macro-block is divided into two partitions of size 16×8), and quadruple (the macro-block is divided into four partitions of size 8×8). For each decision mode, the optimal MV per partition is obtained based on a cost function (8) using a predetermined Lagrangian multiplier (5). Supposing an exemplary partitioned block (b k ), the corresponding optimal MV (MV b K ) is obtained using (9).
In (8), R b k ,mv is the number of required bits to transmit the candidate MV, and QP in (6) is quantization parameter. Besides, SAD b k ,mv in (7) is the sum of absolute differences between the pixels of the current block in the original P-frame (F Org ) and the corresponding block of mv in the reconstructed reference frame (F Rec ). Afterward, the optimal partitioning mode (Mode Opt ) is chosen using (10). If the chosen mode is quadruple, the mode decision algorithm is implemented again on each 8 × 8 pixels partition (full (8 × 8), vertical (4 × 8 ), horizontal (8 × 4), and quadruple (4 × 4)) to select a mode with minimum cost (Fig. 2).
In (10), cf r mode is the final output bitrate of the macro-block based on the candidate partitioning mode, and SSD bk,MV ,mode is the sum of squared differences between the original and the reconstructed (16 × 16) or (8 × 8) block [37].

Proposed steganalysis method
Overview Figure 3 illustrates the block diagram of the proposed steganalysis feature extraction method based on the MVs' spatio-temporal features termed as MVST.
The proposed features are designed to address the shortcomings of the predecessors mentioned in Section 2. We aim to extract a 54 − D steganalysis feature vector per every M consecutive P-frames (each GOP) containing N motion vectors (variable partition size which is allowed by the baseline profile of the H.264/AVC is considered in this study). The steganalysis feature vector consists of a 36 − D spatial feature set and an 18 − D temporal feature set. Using the following scheme, we extract spatial and temporal features and finally concatenate them together. Spatial features for each motion vector will be updated based on its differences with the MVs of eight neighboring partitions of its corresponding block. Temporal features will be updated based on the local optimality conditions of each MV by taking its reconstructed reference frame into account as described in [59].

Frame decoding
As the first step, the current frame and its reference frame are decoded and reconstructed using the input bitstream.

Partitioning mode extraction
For each motion vector, the partitioning mode is obtained during the process of decoding. The bitstream of partitioning mode is the first part of each MV's bitstream, which can be extracted by Golomb − decoding (more details on the bitstream of H.264/AVC can be found in [40]). If the extracted number (Mode 1 ) is smaller than three, we set s = 16 and Mode = Mode 1 . If the extracted number is equal to three, we set s = 8 and apply Golomb − decoding on the remaining bitstream to find Mode = Mode 2 . Subsequently, the partitioning mode is obtained based on (11) in which BS x and BS y are equivalent to the width and the height of the existing block, respectively (Fig. 2 illustrates the partitioning mode evaluation during P-frame encoding.).

Spatial features extraction
Extracting the MVs of neighboring pixels After decoding a P-frame completely, the MV of each pixel in the frame is determined. Since variable-sized blocks are allowed in recent video compression standards, it is possible that the corresponding pixels of each neighboring sub-block have different MV values (Fig. 4, left). Hence, in the proposed method, one position in each neighboring block is selected as the reference pixel. As illustrated in Fig. 4, supposing the situation of the pixel in the top-left corner of the central block is (i 0 , j 0 ), the neighboring MVs are evaluated as follows (Fig. 4, right): Calculating the spatial features The rough idea behind this feature set is inspired by [51].
The difference between each of eight aforementioned MVs and the MV of the central block (MV 0 ) is calculated. Afterwards, as indicated in (13), the features related to the horizontal differences between MVs (f h (K, D)) and vertical differences between MVs (f v (K, D)) are computed (ρ ∈ {h, v}). In the mentioned equations, K ∈ [1,8] and T is the truncation threshold. The mentioned features are designed to capture the correlations between motion vectors corresponding to a static background or a same rigid object. The greater the difference between the MV corresponding to the current block and a neighboring pixel is, the less probable the two MVs are correlated together. Accordingly, we have considered the absolute difference of one (T = 1) as the upper bound of correlation (least correlation value). Therefore, each difference value greater than 1 and smaller than −1 is rounded to 1 and −1, respectively. Also Scale = 1/R (R is motion vector resolution and equivalent to 0.25 in experiments), and δ{a, b} = 1 if a = b; otherwise, δ{a, b} = 1. Finally, we will have a 9-D feature set per each horizontal or vertical neighbor. Combining features of all eight neighbors, we will have a vector containing 8 × 2 × 9 features.

Temporal features extraction
The MVs belonging to all blocks are extracted during the process of decoding the current P-frame. Supposing an exemplary block (b k ) in the current P-frame, the decoded MV belonging to this block (MV 0 (b k )) determines its corresponding reference block in the reference frame (Fig. 5, right). In case any confidential information is embedded in this MV (for instance, one bit of a secret message), the decoded MV might differ from the original MV. The transmitter tries to apply the slightest possible changes to the MVs during embedding to leave as smallest clues as possible. Hence, the confidential information should have been embedded by replacing the original MV with one of its nearest MVs (Fig. 5, left). Notwithstanding the loss of information during motion compensation, the original MV that should be one of the closest MVs to the decoded MV is more likely to be the optimal MV compared to the other MVs on the receiver's side. Accordingly, we compute and compare the optimality of these MVs using the feature set introduced in [59]. However, we have modified this feature set to be compatible with various video compression standards including

Features' dimensionality reduction
Regarding the fact that high dimensions of features lead to (i) requirement to a very big training set, (ii) increasing the probability of classifier overfitting, and (iii) curse of dimensionality, we suggest a dimensionality reduction stage. We reduce the dimensions of spatial features from 180-D to 36 In order to further reduce the spatial feature's dimensions, we sum up the features corresponding to the horizontal and vertical components of MVs, and obtain the 36-D spatial features (f s ) as (15).
To reduce the dimensionality of temporal features, we sum up the SAD and SATD based features to obtain the 18-D temporal features as follows:

Spatio-temporal feature concatenation
Finally as shown in Fig. 6, combining 36-D spatial (f s ) and 18-D temporal (f t ) features, we obtain a 54-dimensional feature set using (17).  Figure 7 shows the first frame of 22 PAL QCIf video sequences (192 × 144 pixels) without prior compression 2 being used to construct the database. These sequences are downloaded from [54]. The selected sequences consist of a wide range of videos concerning diversity in the texture of video, objects' motion, camera movement, and the type of background. Because of containing different numbers of frames, all of the video sequences are divided into non-overlapping 60-frame sub-sequences, and utmost five 60-frame sub-sequences of each sequence are utilized for experiments. Totally, 84 video sub-sequences are used for training.

Video compression method
Because of its wide use and effectiveness, the H.264/AVC baseline profile is employed for video compression. Two different motion estimation algorithms are applied in this test: Exhaustive Search (FULL) and Hexagon-based Search (HEX) [63]. The search range is set to 8 pixels, and the motion estimation resolution is quarter-pixel. Also, three different quantization parameters (QP ∈ {17, 27, 32}) are considered.

Competitor Steganalysis methods
To demonstrate the effectiveness of our proposed method, we compare its results with the results of steganalyzers MVRBF [6], AoSO [48], NPELO [59], which are the best video steganalysis methods against MV-based video steganography up to now. Meanwhile, in [6] features are extracted using macro-blocks; whereas, in the H.264/AVC algorithm, each macro-block may contain some sub-blocks. To adapt this method to the H.264/AVC algorithm, the features are extracted using sub-blocks. Also, these algorithms are not adjusted to video compression standards that support motion estimation with sub-pixel accuracy. As a result, they are not capable of detecting MV-based steganography methods which are applied to such compression standards. We had to refine them to compare their efficiency

Steganography targets
To the best of our knowledge, TAR1 [5] and TAR4 [7] from the second generation, and TAR2 [58] and TAR3 [4] from the third generation are the best MV-based steganography methods until now. Thus these schemes are used in the experiment. Meanwhile, in [5] the set of Lagrangian multipliers used to evaluate MVs' embedding cost function is λ = [0, 2, 4, 6, 8] and we set b = −2 and α = 0.5 for the distortion function. Also, we set h = 8 for the syndrome-trellis coder used in all methods (Syndrome-trellis codes are downloaded from [16]).

Embedding
For each steganography method and each sequence, a random message with uniform distribution 3 and rates ER ∈ {0.1, 0.2, 0.3} per MV is produced and embedded.

Training and classification
We split all embedded and compressed videos into 12-frame sub-sequences 4 , and a feature vector is obtained using each of these sub-sequences. The MATLAB's SVM toolbox is used to train each steganalyzer applying Gaussian and polynomial kernels, and the best figures are listed.

Evaluation criteria for steganalysis performance
There are three major metrics to evaluate the security level of a steganography scheme against steganalysis attacks: Detection Accuracy, Receiver Operation Characteristics (ROC), and Area Under the Curve (AUC) which is also called Detector Reliability. 5 Since detector reliability provides better comparative information than others, we use this metric in experiments [2,11,14,29]. Furthermore, in order to compare the discrimination capability of each steganalyzer, the detector reliability (AUC) is evaluated under different settings.

Performance evaluation setups
Steganalysis algorithms are mostly evaluated assuming some side information is available on the warden's side. These pieces of information include steganography scheme, details about the compression algorithm which is not retrievable on the receiver's side, the embedding rate, and the original cover. These are conditions that can be provided in the laboratory, while in the real world such information is inaccessible [27,49]. To prove the effectiveness and detection capability of the proposed method in various conditions, we apply steganalysis approaches on four following setups:

Setup 1) Complete laboratory conditions:
In this scenario, we suppose that side information about the type of motion estimation algorithm (e.g., FULL or HEX), embedding rate per MV, and the steganography scheme is available. Accordingly, different steganography schemes with different motion estimation algorithms and embedding rates are separately trained and classified. Setup 2) Unknown ME algorithm: This scenario is according to the assumption that the warden is unaware of the ME algorithm, but has knowledge about the steganography algorithm and embedding rate. Therefore, experiments are carried out on a combination of video sequences compressed and embedded by employing full search (FULL) or fast search (HEX) algorithm. Setup 3) Unknown ME algorithm and embedding rate: In this scenario, it is assumed that the steganalyzer has just side information about the steganography algorithm. Therefore, video sequences are grouped based on the type of steganography algorithm and each group contains sequences with different ME methods and embedding rates. Setup 4) Real-world conditions: This scenario aims to address the reliability of the proposed manner under realistic conditions (worst conditions). We assume that the warden is completely ignorant of the ME method, embedding rate, and even steganography scheme. So steganalysis tests are performed exploiting a mixture of video sequences with various settings. In the last two setups, embedded sequences with various settings are randomly selected and fed into the classifier to obtain detector reliability. 6 This stage is repeated 30 times and detector reliability is the average value of them.

Experimental results
(i) Setup 1: In this setup, granting detailed information about the steganography algorithm, quantization parameter, ME algorithm, and embedding rate to the wardens, 6 It should be noted that the number of compressed samples and stego samples subjected to the classifier must agree; otherwise, the classification results would be unsatisfactory. Indeed, Machine learning algorithms are incapable of producing precise classifiers if they are provided with an imbalanced dataset [38].
we aim to measure the reliability of detectors under various settings. Therefore, sequences grouped based on their properties are subjected to the proposed and rival detectors. The corresponding results are compared in Table 2. Table 2 Detector reliability of the proposed blind steganalyzer (MVST), NPELO, AoSO, and MVRB against TAR1-4, using Setup1 with different motion estimation algorithms (ME), embedding rates (ER), and quantization parameters (QP) ME Algorithm Target  The best steganalysis result for each steganography setting is shown in bold As it can be seen, all of the detectors have shown an acceptable performance against TAR1, especially the proposed algorithm and MVRB. These considerable results of the MVRB originate from the weakness of TAR1, which is the selection of embedded MVs based on SAD and without taking the Lagrangian cost into account. consequently, the number of MVs that are optimal concerning SAD is increased after embedding. This increase is sharper when it comes to higher embedding rates and larger quantization parameters (raising the Lagrangian multiplier results in decreasing the influence of SAD on the Lagrangian cost). Since its features are exactly exploited based on SAD, MVRB is successful in tracing TAR1. The performance of two other rivals is negatively influenced when video compression is conducted using a fast search algorithm, while the proposed method has shown a steady level of detection. Indeed, proposed features have proved to reach almost distinctive classes and perfect detection against TAR1.
In sharp contrast to TAR1, TAR2 has demonstrated acceptable resistance against competitors. The stego sequences produced by this method are, however, overwhelmingly perceptible by our proposed features. The majority of results of AoSO and MVRB are fairly close to random-guessing. Using HEX motion estimation algorithm has resulted in weakening the competitors. Table 2 indicates near-complete resistance of TAR3 and TAR4 against AoSO and MVRB in the HEX search conditions. Indeed, these detectors indicate better performance in the FULL search setting. The reason is that both MVRB and AoSO features are exploited based on SAD, and there are more optimal MVs with respect to SAD after motion estimation with the exhaustive search. It can be perceived from the figures that the performance of competitors degrades when a fast motion estimation algorithm is applied. This degradation is far sharper in NPELO, the results of which are as worse as other rivals in the HEX setting. By contrast, the proposed features have maintained a high level of detection against TAR3 and TAR4 in various settings.
In summation, the results have confirmed that the proposed features are far more reliable than rivals' features. Furthermore, our detector's reliability remains relatively stable even at low embedding rates. (ii) Setup 2: In order to specify how far our proposed features are robust against different ME algorithms, training and classification are carried out on a combination of sequences compressed by FULL and HEX search. The results are listed in Table 3. The figures show that the proposed and rival features can detect TAR1 to a high extent, notwithstanding the lack of knowledge about the motion estimation algorithm. Conversely, rivals have no better performance than random-guessing in detecting TAR2 when the payload rate is low (0.1 and less). These schemes have demonstrated better results in tracing TAR3 and TAR4 with the aforesaid setup. Additionally, figures signify that AoSO is slightly inferior to MVRB. (iii) Setup 3: Having knowledge about the embedding rate on eavesdropper's side seems unrealistic. Hence, the classifier should be trained by a combination of sequences altered with different embedding rates. As illustrated in Table 4, the shortage of information about the embedding rate has not affected the reliability of detectors. Surprisingly, TAR1 is blindly detectable by the proposed features. In other words, our detector can distinguish the clean media from the media manipulated using TAR1 to a high extent, without requiring partial information. Likewise, it has achieved dramatic detection rates against TAR2-4. Compared to our approach, AoSO and MVRB have shown to be too weak to trace TAR2-3, with their reliability hovering around 0.53. Table 3 Detector reliability of the proposed blind steganalyzer (MVST), NPELO, AoSO, and MVRB against TAR1-4, using Setup2 with different embedding rates (ER) and quantization parameters (QP) ME Algorithm The best steganalysis result for each steganography setting is shown in bold Overall, a noticeable superiority in the results of our method in comparison to rivals can be observed. (iv) Setup 4: In preceding setups, we supposed all detectors as targeted steganalysis methods. In this setup, we attempt to figure out whether the proposed method and opponents can be regarded as universal steganalysis methods or not. Table 5 suggests that the performances of rivals are considerably degraded when prior knowledge about the embedding algorithm is not accessible, whereas all of the steganography methods are vulnerable to our proposed features. The figures of all targets imply that these outstanding methods can be easily trapped by the proposed method. The best steganalysis result for each steganography setting is shown in bold

Conclusion
In this paper, using Spatio-temporal statistics of motion vectors, we have proposed, implemented, and evaluated a blind steganalysis method for detecting MV-based steganography algorithms. The proposed approach is designed to boost the performance in MV-based video steganalysis by addressing the shortcomings of the previous approaches. Indeed in contrast to the previous method, the proposed method is (i) capable of jointly utilizing the spatiotemporal statistics of the MVs to improve the detection accuracy, (ii) capable of capturing the subtle statistical clues about MV-based steganography by considering the video codec configuration in the feature extraction stage, (iii) generalized to the different video codec configurations namely variable-block-size and sub-pixel motion estimation, (iv) and less vulnerable to overfitting compared to some rival methods due to low dimension of features thanks to the dimensionality-reduction stage.
Experimental results have shown that the proposed features' performance has surpassed the prior outstanding MV-based steganalysis schemes. What sets our proposed steganalysis feature extraction method apart from previously proposed ones is its adaptability to various video compression settings and algorithms. On top of that, the proposed features perform relatively stable in different conditions including different steganography methods, ME algorithms, quantization parameters, and even low embedding rates.
In recent video compression standards, Lagrangian based cost functions are applied to increase the compression efficiency. These functions decide based on the number of bits needed for transmitting the MV, and the SAD between the existing and reference block. The effect of MV's length of the code in the Lagrangian cost is in direct relationship with the Lagrangian multiplier. Therefore, a greater Lagrangian multiplier leads to approaching each MV to its reference MVs, and consequently more correlations among MVs; So it can be inferred that applying smaller quantization parameters causes more resistance against the spatial features. Due to exploiting joint spatio-temporal features, the proposed method has reached the best detection reliability compared to rivals. In real-world conditions where no information about the steganography algorithm, motion estimation method, and embedding rate is available, the proposed features have shown very promising detection results (more than 95% compared to only 65% in case of the best rival). Our approach has shown a remarkable improvement in blind MV-based steganalysis so that the prominent MV-based steganography methods are no longer reliable.
For future work, we aim to consider the scenarios where the confidential information is just embedded in the MVs belonging to deformable objects using object detection and tracking approaches [8][9][10]. Tracking these objects is specifically important since embedding confidential information in their corresponding MVs leads to less evidence of steganography.
Funding Open access funding provided by University of Klagenfurt.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/.