1 Introduction

Video steganography embeds a secret message into an ordinary, innocent-looking cover digital video. The rapid growth of multimedia applications, such as internet video streaming, video telephony, or video conference, that communicate compressed videos between parties make compressed video streams an attractive choice for steganography [1, 2]. By hiding the secret message in a compressed video, a large hiding capacity can be achieved with hard to prove the presence of the message [3].

H.264, standardized as the Advanced Video Codec (AVC) [4], is the most widely used video codec [5]. Despite being obsolete by H.265 [6], H.264 is still supported by more than 65% of security monitoring devices [7]. Regarding software, recent statistics show that 67% of network videos use H.264 codec [7]. Moreover, H.264 is one of only three video codec choices used for YouTube live streaming [8]. The H.264 compressed videos contain various possible hiding spots within the structure of the used compression scheme [2]. For example, within the H.264 compression scheme, many coding stages can be used as hiding spots, such as the coded DCT coefficients [9,10,11,12], some flexible coding parameters (chosen by the encoder such as Macro-Block partitioning and quantization parameters) [13,14,15], entropy coding stage [16, 17] and motion estimation stage [18,19,20,21,22,23]. by introducing some modifications to the motion vectors (MVs) according to the secret message. The modifications are performed for the MV’s attributes, such as magnitude and phase angle, for every MV (for example, using LSB embedding) or selectively over specific candidate motion vectors (CMVs) that satisfy specific criteria.

The MV-based steganographic approaches attracted researchers because they have little impact on the reconstructed frames’ quality and the coefficient’s statistical properties. In other words, the statistical properties of video frames’ spatial/frequency coefficients will not significantly change after embedding. Thus, unlike steganographic schemes that use direct spatial/frequency coefficients, MV-based steganography is harder to spot [24]. However, existing MV-based steganographic approaches that deal with H.264 video codec fail to satisfy real-time constraints imposed by several emerging new applications such as live streaming or conferencing. As we indicate shortly, the embedding process in these approaches either constitutes complex calculations or requires the entire frame or group of pictures (GOP) to be presented before embedding. According to [25], the MV-based steganographic approaches can be classified into three generations.

The first-generation MV-based steganographic approaches share a basic idea of choosing CMVs for modification based on predetermined selection rules, such that when modified by predetermined modification methods (such as LSB), it will not add much distortion to the compressed video. For example, the approach in [23] mark fast MVs as the CMVs based on the assumption that fast MVs (with magnitude larger than a threshold) are more likely to be erroneous and their associated macro-block prediction error (MBPE) is expected to be large and thus, modifying the CMVs, in this case, will not add much distortion. The selection of CMVs based on their magnitude allows both encoder and decoder to agree on a fixed threshold. Therefore, the approach constitutes simple calculations for embedding, which can be implemented in real-time. However, as indicated in [18], the relation between the magnitude of the MV and its associated MBPE is not always true. A more accurate distortion-related selection criterion was proposed in [18], in which the CMVs are selected if their associated MBPE (measured by PSNR [26]) is higher than a threshold. The embedding stage alters the CMVs to embed the secret message. Due to the lossy compression of the MBPE, the lossy compression decompression step is performed at the encoder to determine if the MBPE is still larger than the threshold at the decoder. Suppose the associated MBPE of an altered CMV in a frame becomes below the threshold (after lossy compression-decompression). In that case, the embedding stage decreases the threshold and starts over for the whole frame. Thus, this approach constitutes complex calculations associated with the iterative search for the threshold, which limits the application of this approach for real-time steganography. The primary disadvantage of these first-generation MV-based steganographic approaches is the selection criteria and the failure to maintain the statistical properties of MVs. Specifically, the original MVs should be coherent, i.e., the neighboring MVs tended to have the same magnitude and direction. However, due to the changes introduced by the embedding, the MVs become not coherent and are susceptible to some primitive targeted steganalysis techniques [27, 28]. For example, in [27], the embedding procedure is modeled as an additive operation of the cover video signal with an independent noise signal added to the XY components of the MVs. The statistical evaluation of the spatial and temporal correlations between MVs can reveal the presence of hidden data. Another example in [28] in which the relation between the received and the recompressed (decompress and compress again) video streams are used to create a 15-dimensional feature set called motion vector reversion-based for steganalysis to indicate the presence of hidden messages.

The second-generation MV-based steganographic methods utilize another framework based on minimizing embedding distortion. As the steganographic undetectability and the embedding rate (bits per cover element) are two contradicting objectives for any steganographic scheme, Wet Paper Codes [29] and Syndrom-trellis-Codes [30] (STC) provide a theoretically-proven optimization framework mitigating this conflict. Accordingly, [31, 32] utilizes the frameworks in [29, 30], respectively for minimizing the overall embedding distortion. However, since the frameworks in [29, 30] require the whole cover to be present first in order to do its optimization calculations or at least a large portion of the cover (typically \(10^6\) of cover elements as described in [30]), these frameworks are not suitable for real-time MV-based steganographic applications. Based on the concept that the ME algorithms choose the locally optimal MV, i.e., corresponds to the least cost among all neighbors, any small change of any MV will violate this concept with a high probability. Authors in [33] used the SAD (Sum of Absolute Differences) distortion measure for determining the probability of MVs being locally optimal and proposed the Add-or-Subtract-One steganalysis technique with a 9-dimensions feature set.

The third-generation MV-based steganographic schemes utilize two basic ideas to compromise the Add-or-Subtract-One technique by attacking the SAD-based distortion features. The first idea is based on the fact that the SAD-based features are affected by lossy compression like quantization [34]. Accordingly, if the steganography process chooses a non-locally optimal MV, the resultant distortion to the DCT components after the lossy quantization step will not (with a high probability) preserve the evidence of the non-locally optimal choice of the stego MV [34]. Thus, after the inverse quantization processFootnote 1, the reconstructed 8-neighbors around the stego MV at the decoder side will have almost the same SAD-based cost. Hence, the local optimality feature will be preserved with a high probability. The second idea utilizes the STC technique to minimize the local-optimality-based-SAD distortion further [34, 35]. Thus, the steganographic schemes in [34, 35] could overcome the Add-or-Subtract-One steganalyzer.

Nonetheless, a steganalysis scheme called Near-Perfect Estimation for Local Optimality (NPELO) is proposed in [25] to avoid SAD shortage by utilizing an additional distortion measure, the SATD (Sum of Absolute Transformed Differences), which neutralizes the effect of the lossy compression as SATD takes the quantization step into account. The reader can find the details about the difference between SAD and SATD here [36]. Accordingly, the NPELO method [25] provided a 36-dimensions feature set (18 for SAD and 18 for SATD) to overcome SAD shortage to attack the steganographic schemes [34, 35]. An enhanced steganographic technique in [37] is proposed to overcome the detectability by the NPELO method [25] by considering SATD features for designing the cost function. Combining STC with their cost function, the scheme in [37] increases the security against the NPELO method.

The third-generation steganographic schemes have a major drawback. These schemes did not consider a new feature called Motion Vector Consistency (MVC) [38]. The MVC feature is based on the observation that, for any video codec, the MVs of the sub-MBs in the same MB are typically different with high probability. Hence, an additional detection feature should be considered besides the local optimality and coherency when designing secure steganographic techniques. An enhanced steganographic scheme in [39] is proposed to overcome both local optimality and the MVC features. However, as it uses STC, which requires the cover length to be more than \(10^6\) elements (MVs in our case) for reliable coding performance [30], it is unsuitable for real-time applications.

This paper proposes an enhanced third-generation MV-based video steganography approach designed to attack the steganalysis built upon the local optimality, MVs coherency, and MVs consistency features in addition to performing the embedding/extraction in real-time. The proposed technique achieves real-time performance by operating on a per macro-block (MB) basis, eliminating the need to wait for the entire frame or GOP and avoiding additional re-coding steps. Furthermore, the alteration of motion vectors (MVs) for embedding occurs during the motion estimation (ME) sub-pixel-refinement stage by employing a rule-based strategy to ensure compatibility for embedding in each MB. The compatibility is verified concerning the local optimality, coherency, and consistency of MVs. Moreover, the proposed technique achieves a relatively higher embedding rate compared with other steganographic schemes that attack the same steganalysis features.

It is worth mentioning that a recent video steganography category has emerged due to the rapid development of deep learning, which is the generative adversarial networks (GAN) based steganography techniques. It becomes more attractive to researchers for application in both image [40] and video [41] domains. These approaches are end-to-end, which does not need to design or adaptively select features for information hiding manually. These approaches train the embedding and extracting process simultaneously while a third adversary network that plays a role in steganalysis is utilized. Despite the recent success of these approaches, they still need some crucial aspects. First, the adversary is trained for a specific steganalysis technique; thus, it is not easily generalized for other techniques. Second, due to the encoding process that usually happens for videos, these approaches still need to be more robust against video compression compared with the classical approaches discussed above that are perfectly designed to align with a video encoder pipeline. Therefore, our work in this paper enhances the discussed classical third-generation techniques.

The rest of the paper is organized as follows. The next section describes some essential preliminaries required for the rest of the paper. In Section 3, we present the proposed technique and its implementation in detail. Section 4 presents the experimental results of the proposed implementation. Finally, we conclude the paper and give some insights about the future work in Section 5.

We gather all acronyms in Table 1 for better readability of the paper.

Table 1 Acronyms used in the paper

2 Preliminaries

2.1 Motion estimation for H.264 encoder

Fig. 1
figure 1

General block diagram for video encoder. (MC) Motion Compensation, (ME) Motion Estimation, (\(I_n\)) Uncoded MB, (\(P_E\)) Prediction Error, (FM) Frame Memory, (BS) Output Bitstream, (T) Linear transform (DCT), (\(T^{-1}\)) Inverse T, (Q) Quantization module, (\(Q^{-1}\)) Inverse Q, (MV) Motion Vector, (EntC) Entropy coder, (R) Residual Data (lossy compressed \(P_E\)), (P) Predicted MB calculated by ME and reconstructed by MC, (Search Window) Search window from reconstructed previously coded frame

Figure 1 describes the general coding steps for H.264 video encoder [26]. The ME module estimates an MV for every MB in the current frame by searching for the best match for the MB corresponding to minimum SAD (sum of absolute differences) or SATD (Sum of Absolute Transformed Differences) in a search window in the previously reconstructed frame. Then, the motion compensation module (MC) uses the estimated MVs to reconstruct the predicted macro-block (P). Then, the prediction error \(P_E\) is calculated by subtracting P from the current MB (\(I_n\)). The \(P_E\) is then coded by transformation (T) (which is usually a linear transformation such as DCT), then quantized by the quantization stage (Q) and fed to the entropy coder (EntC) module.

The H.264 standard [4, 42, 43] introduces a sub-pixel motion estimation strategy as an optional quality-enhancement stage for video coding. After the ME completes its task, an additional refinement search is executed after performing sub-pixel (half or quarter pixel) interpolation. Figure 2 describes this procedure, where the cost, which can be SAD or SATD, is calculated for full-pixel (F-Pel), half-pixel (H-Pel), and quarter-pixel (Q-Pel) search setups. This cost is denoted by C_X, where X is the specific MV-search-setup. Then, the encoder decides according to C_X. Specifically, it chooses between F-Pel, H-Pel, and Q-Pel according to the minimum value of C_X. As shown in the figure, if C_H-Pel < C_F-Pel, the encoder sets the new search center as the H-Pel position. Otherwise, the search center position remains intact, and the Q-Pel search is performed around the selected search center position. If C_Q-Pel < C_H-Pel, the encoder uses the Q-Pel as a final result; otherwise, the final result will be the search center (H-Pel or F-Pel). As we discuss shortly, we exploit the above sub-pixel refinement stage in designing our steganography approach by introducing some modifications to this stage.

Fig. 2
figure 2

Flowchart for the original Sub-Pixel refinement process

2.2 Coherency and consistency of MVs

The coherency of MVs means the tendency of the neighboring MVs of the MBs to have the same magnitude and direction, as shown in Fig. 3. In other words, MV-differences for neighboring MVs tend to be zero-mean distributed [37]. Several steganalysis approaches detect MV-based steganography by checking the coherency of the neighboring MVs. For example, the method in [27] utilizes 12 features based on the coherency of the MVs and built a Support-Vector-Machine (SVM) classifier to detect the MV-based steganography for MPEG2 videos [44].

On the other hand, the consistency of MVs is another concept that was introduced in [38] by observing that the H.264 encoder decides to either keep the larger block or divide it into smaller sub-MBs with different MVs after comparing the cost of the two cases and takes action with minimum cost. Thus, if the encoder decides to divide a larger block, the resultant neighboring sub-MBs will, with a high probability, have non-coherent MVs. Alternatively, suppose the MVs of the sub-MBs within the same larger block have the same MVs. In that case, evidence for a steganographic modification can be triggered, as the cost of using a single MV for one large block will be smaller than having the same MVs multiple times (for each sub-MB).

Thus, the concept of MV consistency indicates the non-coherent nature of the MVs of the sub-MBs in regular (non-modified) original videos, as shown in Fig. 3. Therefore, the MVs consistency can be used to detect MV-based steganography [38]. Both features, coherency, and consistency, can be used together for enhanced steganographic detection as coherency is applicable only for MBs of type \(16\times 16\), and consistency is applicable only for the sub-MBs.

Fig. 3
figure 3

Example illustrating the consistency and coherency of MVs

2.3 Local optimality

The ME algorithms choose the MV corresponding to the least cost among all neighbors because it corresponds to the least number of bits required to represent the MV and its corresponding error signal calculated by SAD or SATD. Thus, the MV should be locally optimal w.r.t its neighbors [33]. When any steganographic technique modifies the MVs for embedding, there is a high probability that the modified MVs become not locally optimal, which means that there exists another MV that corresponds to a smaller cost.

Based on the local optimality of the MVs, authors in [33] proposed an algorithm that calculates the SAD errors for the 8-neighbors surrounding each MV. If any MV of the eight neighbors achieves a SAD error less than the initially received MV, it will be treated as an indication for steganographic modification. The authors designed a 9-dimensions feature set based on the local optimality and trained an SVM for prediction. Their method was extended in [25] with a 36-dimensions feature set for better accuracy.

3 Proposed video steganography approach and its implementation

In this section, we present our proposed video steganography technique designed to attack the steganalysis built upon the detection of local optimality, MV coherency, and MVs consistency criteria in addition to performing the embedding/extraction in real time. The proposed technique is designed to satisfy the above constraints by operating per-MB in the ME sub-pixel-refinement stage without waiting for the whole frame, GOP, or performing any additional ME or re-coding step. For each MB, the proposed technique checks whether the MB is suitable for embedding through our designed MB compatibility criteria. The compatibility of the MB for being modified is carefully checked to ensure that the steganalysis methods built on the MV local optimality, coherency, and consistency features do not detect the embedding. In the following, we present the proposed technique followed by its implementation details.

3.1 Macro-blocks compatibility criteria

Fig. 4
figure 4

Flowchart for MBs compatibility checking for embedding. This compatibility checking is a modified version of the original sub-pixel refinement process, so it can be integrated directly into the encoder and operates in real-time as it operates on an MB basis

The proposed technique operates on an MB basis. For each MB, the proposed technique checks its compatibility for embedding. Then, the embedding is performed by modifying the MV of the MB according to the current message bit(s) denoted by Msg. The flowchart of this checking is depicted in Fig. 4. As shown in the figure, all \(16\times 16\) MBs are unsuitable for embedding by the proposed technique. In other words, the proposed technique embeds the Msg by modifying the MV associated only with the sub-MBs. It does not introduce MV-related changes associated with the \(16\times 16\) MBs. This makes the neighboring \(16\times 16\) MB have their original MVs, and thus, the output video stream from our proposed embedding technique neutralizes the coherency features.

Fig. 5
figure 5

Flowchart for the embedding process at the encoder (i.e. Q-Pel-Emb-mod(Msg) in Fig. 4)

For all sub-MBs, we first check if the MV corresponding to the sub-MB for H-Pel is locally optimal. If so, the sub-pixel refinement process typically performs the Q-Pel search and examines its output. Suppose the MV corresponding to Q-Pel is locally optimal, and the value of the MV at this Q-Pel matches the Msg. In that case, the encoder uses the Q-Pel without any modification (indicated as Q-Pel-done-embd in the figure). This way, we make sure that the local optimality is preserved. On the other hand, if the value of the MV at this Q-Pel does not match the Msg, the encoder discards the Q-Pel search result and returns to the H-Pel or F-Pel even if the Q-Pel is locally optimal, preventing the Q-Pel motion vector from being transmitted. Thus, the encoder uses the H-Pel (or F-Pel) in this case. The MV related to this H-Pel (or F-Pel) is locally optimal with respect to the selected pixel resolution. As shown in the figure, Msg is embedded by altering an MV in a single case indicated by Q-Pel-Emb-mod(Msg) block. This happens when there is no locally optimal MV in the Q-Pel neighborhoods of the sub-MB. Thus, when any MV is selected from the neighborhood to be used to embed the Msg, this modification will not be detected by consistency, coherency, and local optimality-based steganalyzers.

The detailed operation of Q-Pel-Emb-mod(Msg) block in Fig. 4 is described in Fig. 5. Before presenting the detailed operation of the Q-Pel-Emb-mod(Msg) block, let us denote class-1 sub-MBs for the sub-MBs that result from dividing a larger MB into four parts with sizes \(8\times 8\) or \(4\times 4\) sub-MB types. Similarly, let us denote class-2 sub-MBs for the sub-MBs that result from dividing a larger MB into two parts with sizes \(16\times 8\), \(8\times 16\), \(8\times 4\), or \(4\times 8\) sub-MB types. Both class-1 and class-2 are illustrated in Fig. 6.

To avoid being detectable by the consistency features, we need to maintain the randomness of the MVs of the sub-MBs. To do so, we perform the embedding unconditionally in class-1 sub-MBs without worrying about the consistency of their MVs, while we embed only in the first sub-MB in class-2 sub-MBs. The reason is that the message data is usually modeled as a uniform random sequenceFootnote 2 [45]. Thus, the probability \(P_{c2}\) of embedding the same message symbol a specific number of consecutive times in class-2 sub-MBs is higher than the probability \(P_{c1}\) of embedding in class-1 sub-MBs. Typically, \(P_{c1} = P_{c2}^{3}\). Thus, in the case that each modified MV can hold a symbol of 2-bits of Msg, the probability of obtaining an identical symbol for all child sub-MBs (i.e., two consecutive times for class-2 and four consecutive times for class-1) will be 0.25 for class-2 and \(0.25^3\) for class-1. Please see Table 2 for other cases of embedding bits of Msg. Please note that, in practical scenarios, the repeated symbols may be distributed across non-neighboring sub-MBs. Thus, the probabilities listed in Table 2 can be considered an upper limit.

Fig. 6
figure 6

Different Macro-Block partitioning modes for inter-frame prediction. Class-1 sub-MBs are shaded while class-2 sub-MBs are not

When embedded in class-2 sub-MBs, we ensure that this first sub-MB’s MV differs from the second one’s MV by a simple check. If the MV of the second sub-MB matches the first one after the modification introduced by the embedding, we modify the second MV randomly. This check is essential to avoid making the modified MV of the second sub-MB for class-2 sub-MBs the same as the first modified one, thus making it prone to detection by the consistency features. This procedure is represented by the (KeepConsistency block).

After determining the compatibility of the sub-MB for embedding, the next step is to perform data embedding and extraction, which will be described in the following subsection.

Table 2 The probability of randomly getting the same embedding symbol for all sub-MBs of class-1 and class-2 sub-MBs for different embedding settings

3.2 Embedding and extraction

As we indicated, the embedding is performed by modifying the MV associated with all class-1 sub-MBs and the first sub-MB in class-2 sub-MBs. The modification is performed when there is no locally optimal MV in the Q-Pel neighborhoods of the sub-MB, as indicated by Q-Pel-Emb-mod(Msg) block in Fig. 4. We came to the Q-Pel-Emb-mod(Msg) block either when the F-Pel or H-Pel is used as the refinement search center (indicated by the gray shaded rectangle in Fig. 4). The proposed technique modifies the MV according to (a) the Msg and (b) the refinement search center (either F-Pel or H-Pel). The modification is performed for F-Pel or H-Pel refinement search centers, as described in Fig. 8(a), (b), (c), and (d), respectively. As shown in the figure, the proposed embedding technique takes 3 bits as Msg and modifies the MV accordingly. For example, if the Msg is 001 and the refinement search center is F-Pel, then the MV is modified to point to the upper position. Another example is if the Msg is 111 and the refinement search center is one of the diagonal H-Pels, then the chosen position is the right one.

As illustrated in Fig. 8, the embedding diagram differs for the search centers of the F-Pel or any of the different H-Pel positions (horizontal, vertical, or diagonal). To explain this rationale, let us consider the location designated by the shaded Q in Fig. 7 as an example. Since the decoder only receives the MV pointing to the location marked by the shaded Q, this location can be interpreted at the decoder side with four different possibilities:

  • Upper right direction w.r.t the F-Pel.

  • Lower left direction w.r.t the upper right diagonal H-Pel.

  • Lower right direction w.r.t the upper vertical H-Pel.

  • Upper left direction w.r.t the right horizontal H-Pel.

This means that three H-Pel positions are considered for one F-pel, which results in decoder confusion. To remove the confusion at the decoder side and allow a uniquely decodable embedding, we have used different embedding diagrams for each search center to ensure that all these mapping possibilities are mapped to the same Msg. Thus, for the shaded Q in Fig. 7 to be mapped according to the mapping rules in Fig. 8, the mapping result (as shown in Fig. 9) indicates that the shaded Q can be mapped only to the symbol 010 for all previously described four possibilities. Hence, the position of the shaded Q in Fig. 7 is uniquely decodable to the symbol 010.

Fig. 7
figure 7

Sub-pixel refinement scheme after interpolation. Symbols F, H and Q denote F-Pel, H-Pel, Q-Pel respectively

Fig. 8
figure 8

Embedding scheme for (a) F-Pel, (b) diagonal H-Pel, (c) vertical H-Pel, (d) horizontal H-Pel

Fig. 9
figure 9

Actual embedding diagram according to Figs. 7 and 8

The decoding process is straightforward. For the received MV with MVx and MVy representing the X- and Y-components, respectively, the decoder calculates the modulo 4 for both MVx and MVy first, then directly extracts the Msg by using mod(MVx,4) and mod(MVy,4) as coordinates for Msg in the table in Fig. 10. For example, consider an MV with (27,35) components (after interpolation). The coordinates of the extracted message will be (27 mod 4, 35 mod 4), i.e., (3,3), and these coordinates will index the table in Fig. 10, and hence the message will be 000. Please note that the values denoted by X in the table in Fig. 10 represent the F-Pel and H-Pel search centers, which cannot be used for embedding.

The following section presents our embedding and extraction schemes using an open-source implementation (OpenH264 [46]) for the H.264 codec.

3.3 Implementation of the proposed video steganography within OpenH264

We have implemented the proposed video technique using the OpenH264 [46] software encoder implementation for H.264. OpenH264 encoder achieves real-time performance because it exploits the SIMDFootnote 3 instruction set for X86 and ARM architectures. However, the implementation has different combinations of optimizations and modifications, such as:

  1. 1.

    The ME algorithm used is the Diamond search algorithm [47].

  2. 2.

    The encoder does not support Bi-directional frames (only P-frames are supported).

  3. 3.

    For the ME sub-pixel-refinement stage, only 4-positions are checked. These positions are marked as (F, H, Q) in Fig. 11. In other words, the OpenH264 ME sub-pixel-refinement stage checks only 4-positions (up, down, right, and left positions with no diagonal position checked).

As OpenH264 uses only four search positions in both H-Pel and Q-Pel, we had to adopt the generalized proposed technique in the previous subsections to be implemented inside the OpenH264 encoder. Accordingly, the modified embedding technique embeds 2 bits per MV instead of 3 bits. Thus, only two mapping diagrams are utilized instead of four, as described in Fig. 12. The same process is performed on the decoder side as the generalized proposed technique, except the extraction table is replaced by the table in Fig. 13.

4 Experimental results

This section compares the proposed technique with the MVMPLO [35] approach. The comparison is focused on the detectability of local optimality using the NPELO technique [25], coherency [27], and MV-consistency [38] based detectors for both approaches. The MVMPLO approach was chosen for comparison among other steganographic methods. It achieves the best security performance against local optimality and consistency analyzers according to [25] and [38]. Additionally, we provide another comparison with the methods proposed in [39].

This section is organized as follows. First, we present our experimental setup, including the video dataset, the steganalyzers used in our experiments, and the metrics used to measure the performance. Then, we present our experimental comparison with the MVMPLO [35] method. Finally, we present our experimental comparison with the methods in [39].

Fig. 10
figure 10

The extraction table

Fig. 11
figure 11

OpenH264 sub-pixel refinement scheme after interpolation. Symbols F, H and Q denote F-Pel, H-Pel, Q-Pel, respectively. Sub-pixels denoted by x are not included in sub-pixel search

4.1 Experimental setup

The video dataset used in our comparison contains 44 videos from [48, 49]. The dataset contains 5, 11, and 28 videos with 1080p, 720p, and cif (\(352\times 288\)) resolutions, respectively. The dataset has a large diversity of motion dynamics. To show this, we compute the Motion-Activity-Indices [50] (MAI) for each video in the dataset. The MAI is a number between 1 and 5 that describes the dynamics of the video, where 1 implies very low dynamics while 5 implies very high dynamics videos. We show the summary of our dataset in Table 3.

The proposed technique is implemented and integrated into the OpenH264 [46] real-time video encoder. We have utilized the MVMPLO implementation in [38] with the x264 [51] video encoder.

All used steganalysis methods first extract some features from the videos and then employ these features to build a classifier that classifies the video as cover (original) or stego (has hidden data embedded). For the local optimality-based steganalyzer in [25], we utilized the feature-extraction tool [52] implemented by the authors. For the consistency-based steganalyzer in [38], we also used the authors’ feature-extraction tool that produces 12 features. The coherency-based steganalyzer method [27] was implemented by us as there is no implementation available from the authors.

Fig. 12
figure 12

The modified embedding scheme for (a) F-Pel and (b) H-Pel of OpenH264 encoder

Fig. 13
figure 13

The modified extraction table of OpenH264

Table 3 Our video dataset

After feature extraction of each steganalysis method, we used the SVM tool in Matlab-2020a© to build our SVM classifier for each method. We used a 5-fold cross-validation procedure to get the best kernels of the SVM with its best parameters. We found the fine Gaussian kernel with a kernel-scale parameter of 1.5 for the local optimality-based steganalyzer [25] achieves the best accuracy results. Also, the Gaussian kernel with a kernel-scale parameter of 0.87 achieves the best accuracy results for the consistency-based steganalyzer [38] and the coherency-based steganalyzer [27]. As recommended by each steganalyzer method, the dataset is fed to the steganalyzers divided into samples of GOPs: 12 frames for local optimality-based and coherency-based steganalyzers and six frames for consistency-based steganalyzer per each GOP.

Please note that the coherency-based steganalyzer [27] recommends SVM with linear kernel, but the original method was designed for MPEG2, which does not support sub-MB and skipped 16 \(\times \) 16 MB, unlike H264 we used here. We experimentally found that using the Gaussian kernel achieves better performance than the linear kernel. So, this kernel change comes in favor of the method’s performance.

As indicated, the proposed steganography scheme is implemented within the OpenH264 encoder, but the MVMPLO uses the x264 encoder. However, the two implementations are for the same H.264 standard, and when fed by the same settings, both implementations should produce the same encoded videos. Thus, to ensure a fair comparison, we have performed the following:

  1. 1.

    We applied the same H.264 configuration settings for all encoders in all experiments and used the same video dataset.

  2. 2.

    We embed the same number of bits for both MVMPLO and the proposed techniques in our comparison against the steganalyzers (see Fig. 14 below).

  3. 3.

    We train two versions of each steganalyzer method separately, one for each steganography technique. For each steganography scheme (MVMPLO and the proposed one) and steganalyzer method, we used the exact configurations of the video encoders to get the output videos (without embedding). Then, we apply each steganography scheme to get the stego videos. We keep the training settings fixed in the two versions. Thus, we obtain six steganalyzers in total (two steganography techniques and three steganalyzers).

  4. 4.

    In our comparisons, as we indicate shortly, we have used a relative performance metric (for PSNR and bit rate) to measure the difference in metrics between each technique’s original and stego versions.

A final notice is that the MVMPLO approach uses the STC framework to select the suitable MVs for modification [30] that minimize a cost function, comprises the embedding distortion, and take care of the local optimality features, and hence achieve minimal detectability w.r.t. the local optimality steganalyzer. The STC requires processing more than \(10^6\) MVs to obtain reliable performance. Therefore, in our experiment, all MVs of each video are extracted first and then processed by STC at once to maximize its security against the steganalyzersFootnote 4 and achieve the recommended reliable STC performance. For both steganography schemes, we set the quantization parameterFootnote 5 (QP) to 25.

4.2 Comparison against the MVMPLO technique

This section compares the proposed technique and the MVMPLO scheme regarding several metrics, which include the number of embedded bits per video, the detection performance of the several steganalyzers used, the reduction in PSNR due to the embedding process, and the running time overhead of each scheme.

Fig. 14
figure 14

Number of embedded bits per each video in Table 3 for MVMPLO and the proposed technique

First, as our proposed technique uses rule-based MV selection, the number of embedded bits depends mainly on the video contents. In other words, we can not pre-determine the number of embedded bits in our scheme (more on this is explained in the Section 4.4), unlike MVMPLO, which can have a pre-determined number of embedded bits as it can be pre-configured for a particular distortion. Thus, to provide a fair comparison between the two schemes, we had to run the proposed technique and calculate the number of embedded bits first, then try to embed the same number of bits per video with MVMPLO. Figure 14 compares the number of embedded bits for the proposed technique and the bits of the MVMPLO scheme. The figure shows that both schemes have almost the same number of embedded bits per video.

$$\begin{aligned} \textrm{Recall} = \frac{T_P}{T_P + F_N}. \end{aligned}$$
(1)
$$\begin{aligned} \textrm{Prec} = \frac{T_P}{T_P + F_P}. \end{aligned}$$
(2)
$$\begin{aligned} \mathrm {F_1} = 2\times \frac{\textrm{Recall} \times \textrm{Prec}}{\textrm{Recall} + \textrm{Prec}}. \end{aligned}$$
(3)
$$\begin{aligned} \mathrm {A_{CC}} = \frac{T_P + T_N}{T_P + T_N + F_P + F_N}. \end{aligned}$$
(4)

Second, for each steganalyzer, we treat the extracted features from the original cover videos as negative class samples. In contrast, the positive class samples comprise the features of the stego videos. We compute the recall, precision, F1-score, and accuracy according to (1), (2), (3), and (4), respectively, shown below. We summarize these values in Table 4. The table shows that the proposed technique achieves almost the same security performance as the MVMPLO approach in attacking the NPELO technique while greatly outperforming the MVMPLO approach in attacking the consistency-based steganalyzer [38]. Additionally, although both steganographic schemes neutralize the coherency steganalyzer [27], the proposed technique achieves better results as it preserves the \(16\times 16\) MBs intact, unlike the MVMPLO approach.

Table 4 Results of steganalyzers against MVMPLO and the proposed technique
Fig. 15
figure 15

Percentage bit rate variations (represented by \(\Delta _{100}\) score) of the proposed technique and MVMPLO

Both steganographic schemes achieve a relatively high-security performance in attacking the coherency steganalyzer due to a particular MB type in the H264 encoder called skipped 16 \(\times \) 16 MB. The H264 standard utilizes the skipped 16 \(\times \) 16 MB type to achieve higher compression performance by excluding the coherent-neighboured-MBs from the output bit stream. Instead, the decoder estimates these MBs using the surrounding MBs. Accordingly, this feature reduces the overall video size. However, this poses a significant challenge for MV-coherency steganalyzers when applied against steganographic methods implemented within the H264 encoder that retain the motion vectors (MVs) associated with the skipped 16 \(\times \) 16 MBs and operate only on the non-skipped 16 \(\times \) 16 MBs. The reason is that the MVs related to the non-skipped 16 \(\times \) 16 MBs represent the average incoherent ratio of MVs that describes the typical irregular portion of the motion field within the video frame. Hence, it will highly reduce the detection performance of the MV-coherency-steganalysis like [27], as described in Table 4.

Also, it is worth mentioning that the proposed technique performs slightly worse than MVMPLO in attacking the NPELO technique. The reason for this performance can be justified by Fig. 4, which shows the flowchart for MB compatibility checking for embedding. The proposed technique may violate the local-optimality constraint when the value of the MV at a certain Q-Pel does not match the Msg, where the encoder discards the Q-Pel search result and returns to the H-Pel or F-Pel, which with some portability may not be locally optimal. However, in practice, this results in a small performance drop, as we showed in Table 4.

Fig. 16
figure 16

Percentage PSNR variations (represented by \(\Delta _{100}\) score) of the proposed technique and MVMPLO

Third, we compare the proposed technique with the MVMPLO regarding the embedding effect on both the bit rate and PSNR. We demonstrate PSNR and bit rate variations between the cover and stego videos for our setup using the \(\Delta _{100}\) score described as

$$\begin{aligned} \Delta _{100} (x,y) = 100\times \frac{x-y}{y}. \end{aligned}$$
(5)

Using (5), we substitute y with the original cover value (PSNR or bit rate) and x with the corresponding stego value. We present this comparison in Figs. 15 and 16 for bit rate and PSNR, respectively. The Figures show that the effect of the proposed technique on PSNR and bit rate is almost the same as MVMPLO.

Finally, we compare the time overhead of the proposed and the MVMPLO schemes. The proposed technique adds a little overhead (\(1-2\%\)) beyond the encoder running time. This indicates that the proposed technique preserves the real-time performance of the encoder. In contrast, the MVMPLO scheme uses STC coding to minimize the embedding distortion. Using STC reliably requires processing more than \(10^6\) MVs to select the suitable MVs for modification [30]. Hence, it is required to re-calculate the new residuals corresponding to the new MVs; it cannot achieve the real-time constraint. In our experiment, all MVsFootnote 6 of each video are extracted and processed by STC-encoder to enhance its security, achieving a fair comparison with the proposed technique.

4.3 Additional comparative analysis

Here, we extend our experimental comparison by cloning the experiments’ setup in [39]. Specifically, we used the same dataset and H.264 coding parameters in [39] and applied the cloned setup for the proposed technique, which allows us to use the results provided by [39] to introduce further comprehensive comparison between the proposed technique and recent steganographic techniques in the literature.

As the real-time design of the proposed technique does not allow predefined multiple embedding rates (as we indicated in the previous subsection), we have chosen specific entries of Table 1 and Fig. 3 in [39] that match the obtained score of the average bpnsmv (Bits Per Non-Skipped Motion Vector) of our proposed technique.

Table 5 provides a performance comparison between the proposed technique and five techniques listed in [39]. The methods Tar1, Tar2, and Tar3 in the table refer to [18], MVMPLO, and [32], respectively. Also, dMVC and dMVC+LO refer to the technique proposed in [39]. Additionally, the scores of the steganalyzers (NPELO [25] and MVC [38]) are represented in terms of the minimum average prediction error (as described in [39]), not the accuracy. Finally, \(\Delta Bitrate\) represents the percentage change of bit rate due to embedding. It can be concluded from Table 5 that the proposed technique achieves acceptable security and performance margins while solely preserving the real-time constraints. It should be noted that the coding performance (PSNR and \(\Delta Bitrate\)) of OpenH264 is slightly lower than JM 19.0 [54] (utilized in [39]) due to the optimizations described in Section 3.3.

Table 5 Comparison between the proposed technique against other techniques using Table 1 and Fig. 3 in [39]

4.4 Discussion

Generally speaking, two challenges are associated with MV-based video steganography for achieving real-time performance. First, selecting the best MVs to be modified is performed through a cost function that typically requires the whole frame or GOP, as in [31, 32, 34, 35, 39, 55]. Second, as MV-based video steganography techniques modify the MVs to insert the secret data, the MVs consequently require re-encoding to calculate the new DCT residuals. Otherwise, it will severely affect the quality of the resulting videos. Thus, performing MV-based video steganography in real-time is challenging. To our knowledge, the only available solution to achieve real-time performance with MV-based video steganography is the scheme in [15]. The scheme exploits the partition mode and establishes a mapping rule between message bits and the PMs that allows the modification of the PMs according to the message bits. Accordingly, the scheme forces the choice of the partition mode modes during ME process of selected frames according to the message bits. A frame is selected for this modification only if it contains a scene change. However, the modifications performed by this scheme are easily detected as shown in [38], by the consistency steganalyzer in [38] even for a very low embedding capacity (detection accuracy of \(93.83\%\) for 0.05 bits per MB). Although the proposed technique bears some similarity with the scheme in [15] in establishing a mapping rule between message bits and the modifications performed to the MVs, our scheme shows excellent performance against the consistency steganalyzer, thanks to our carefully designed compatibility criteria.

Achieving real-time constraints with the proposed technique is realized but with a sacrifice: the proposed technique can not pre-determine the number of embedded bits. For the proposed technique to meet the real-time constraint, it performs the embedding on the fly, i.e., without waiting for the whole video or a large number of frames to be available. Specifically, the proposed technique is designed to operate per-MB in the ME sub-pixel-refinement stage without waiting for the whole frame, GOP, or performing any additional ME or re-coding step. Therefore, we can not set a specific number of bits to be embedded in advance. Comparatively, other techniques in literature can pre-determine the number of embedded bits as they either employ an optimization strategy to select the best modifications in several frames or GOP to meet this pre-determined number of bits or re-perform the ME after the modifications. Despite the commitment to embed a certain number of bits, these techniques can not perform the aforementioned embedding strategies in real-time due to the waiting time for several frames or the GOP.

Another point that we want to highlight is that the proposed technique keeps the encoder’s most resources-consuming-processes intact. Specifically, according to [56], ME and interpolation (for sub-pixel) consumes about \(68.7\%\) and \(18.6\%\) respectively of the whole encoder resources with total \(87.3\%\) for the complete ME process. Although the technique approach operates within ME and the sub-pixel refinement stage, the embedding process affects only the last decision. No interaction with SAD, SATD, or interpolation filters is performed. That is why only \(1-2\%\) resources overhead is achieved.

Finally, the proposed technique does not allow embedding in non-optimal MVs as no error-coding scheme is employed on the MB basis like the STC framework. Thus it is more restrictive than MVMPLO. However, the proposed technique can embed up to 3 bits per MV, thus relaxing its restrictive nature. As we show in Section 4.2, even for the case of embedding 2 bits per MV, the proposed appraoch has a comparable embedding capacity with the MVMPLO approach, which is expected to increase when we embed 3 bits per MV instead of 2.

5 Conclusion

In this paper, a new approach for real-time video steganography has been introduced. The proposed technique is integrated smoothly into the H.264 standard and achieves an outstanding performance against state-of-the-art steganalysis techniques while maintaining real-time encoding performance constraints. The proposed technique embeds the secret message by altering the motion vectors (MVs) while preserving their local optimality, coherency, and consistency to withstand the recently emerged steganalysis methods. The proposed technique is implemented within the OpenH264 encoder, and the experimental results demonstrate that the proposed technique offers excellent performance in attacking local optimality, coherency, and consistency steganalyzers, compared with state-of-the-art techniques. Specifically, the proposed technique significantly reduces the performance of the associated classifiers of the steganalyzers regarding precision, recall, and accuracy. Additionally, the proposed technique performs the embedding in real time and only adds a little overhead (\(1-2\%\)) beyond the encoder running time.

In the future, we will consider an important practical application of the proposed technique. Specifically, we intend to implement the proposed technique as an extension of WebRTC [57]. This can add real-time steganography capabilities within the video-communication solutions that can work on top of open web standards.