1 Introduction

The rapid and extensive progress of internet and networking technologies has simplified the replication, alteration, reproduction, and distribution of multimedia content through physical transmission media. This occurs during communication, information processing, and data storage, all at a low cost and without compromising the quality of the content. Therefore, securing data and maintaining digital information from upcoming hackers threats is primordial. Different data hiding techniques were proposed to resolve this problem, such as cryptography, steganography, and digital watermarking. This last one consists in embedding the signature into the original content and then trying to detect it after different manipulations are applied to the marked content. Watermarking is used for several applications, such as content protection, copyright management, content authentication, and tamper detection. Figure 1 illustrates several widely recognized applications of watermarking.

Fig. 1
figure 1

watermarking applications

In the last two decades, many traditional watermarking approaches have been proposed to secure different types of media, such as image [84], 2D video [88], 3D models [20], and audio [36]. These traditional approaches are based on embedding signatures into feature regions using spatial or frequency domains, and they approve efficiency in invisibility and robustness against attacks. Recently, watermarking has seen a lot of scientific interest-based artificial intelligence. Deep learning is nowadays the most powerful, time, and cost-efficient machine learning approach. Deep learning has significantly improved in numerous applied research areas such as computer vision, medicine, natural language processing, object detection, face recognition, handwriting recognition, and speech recognition. It is one of the fastest-developing methods with a significant breakthrough performance. Consequently, the high performance of the deep learning models is considered efficient in protecting the intellectual property of any digital multimedia content. Since 2017, several deep learning-based watermarking techniques have been proposed to embed signatures into media content. However, most of these works focused on image content where techniques are generally classified based on their network architecture [17]. Although many traditional watermarking algorithms are proposed for video, 3D models, and audio, deep learning models are yet to focus on these areas. Indeed, to our knowledge, there is only one proposed paper for 3D models [93], two papers for video and no paper exploring deep learning models for audio watermarking.

As deep learning-based watermarking is a relatively recent area of research, current surveys concentrate on traditional algorithms. Since 2020, a few survey papers have been proposed concerning deep learning image watermarking, but no survey paper has focused on deep learning-based video watermarking techniques. Byrnes et al. [17] proposed a comprehensive survey regarding deep data hiding models unifying digital watermarking and steganography. Zhang et al. [95] also presents a brief survey on deep learning-based data hiding, steganography, and watermarking for images. Li et al. [51] provides an overview of watermarking of deep learning models, and [29] gives a brief survey of image watermarking based on deep neural network architecture.

As deep learning-based watermarking continues to expand, and several works were recently proposed for video, it is important to summarize and compare the current methods proposed for image and video. This survey aims to briefly classify traditional watermarking techniques for image and video and discuss existing deep learning models for image and video watermarking. We present future directions in deep learning-based video watermarking areas that research may take. The key contributions of the survey are as follows:

  • This survey briefly classifies and compares the existing traditional watermarking techniques proposed for image and video.

  • We provide a classification of deep learning-based watermarking techniques based on the network architecture and the embedding domain.

  • We discuss and compare the most popular deep learning-based image and video watermarking techniques to give researchers a clear understanding of the practical challenges of deep learning-based watermarking.

  • We also present some future directions for deep learning-based video watermarking.

The rest of the paper is organized as follows. In Sect. 2, we introduce a review of the traditional image and video watermarking techniques by classifying them based on the embedding target and the used domain. Section 3 presents a comparison of image watermarking techniques utilizing deep learning methods, focusing on their network architecture. The next section details the small number of existing deep learning-based video watermarking techniques and shows their advantages. Sect. 5 discusses video deep learning-based watermarking challenges and gives some suggestions to researchers in this domain. Eventually, in Sect. 7, we draw some conclusions and highlight some directions for future works.

2 Traditional Image and Video Watermarking Techniques Review

Watermarking is a branch of data hiding technology that hides information in digital content to be transmitted securely in the network. Information-hiding technology mainly includes steganography, covert communication, and watermarking. This technique protects digital content against several security problems, such as illegal data distribution, usage, duplication, manipulation, and storage. Indeed, it embeds a signature into the original content and then tries to detect it after different manipulations are applied on the marked content. Usually, a robust watermarking technique should be invisible. Watermarking is an important research area, thanks to its use in several media applications such as copyright protection and owner Identification, copy control and fingerprinting, content authentication and integrity verification, broadcast monitoring, indexing, and medical applications.

2.1 Watermarking Terminology

The watermarking process comprises two main stages: signature embedding and signature detection. Embedding is the stage in which a signature containing the author’s information or copyright information is embedded within a hosting multimedia content through a specific embedding method, as shown in Fig. 2. First, the hosting content, an image, a video, or a 3D model, is eventually transformed depending on the chosen embedding target (DWT, DCT, FFT, etc.). Then, the signature is generated by scrambling watermark information randomly by using a secret key to enhance the security of the embedding method. Watermark can also be generated by applying several encryption algorithms as proposed in [14, 19, 53, 69]. The obtained mark is embedded within the selected coefficients, which will then be brought back into the original domain to obtain the marked content.

Fig. 2
figure 2

Embedding stage

The signature detection stage tries to extract the embedded watermark and it is usually decomposed in the same steps of the embedding stage as shown in Fig. 3. Given a marked media, the same transformation used at embedding will be applied and the detection algorithm will be applied to the obtained coefficients. The signature detection stage may require knowledge of the original content. In this case, we say that a watermarking algorithm is non-blind. Contrary, if the watermark is recovered without resorting to the comparison between the original media and the marked one, the watermarking algorithm is blind.

Fig. 3
figure 3

Detection stage

If the signature contains a sequence of N bits, it can be read from the marked media. In this case, the watermarking algorithm is called multi-bit watermarking. However, in the 0-bit watermarking, the detector tries to decide whether a known signature is present in the given media. In several applications, the two types can be required where the detector must verify at the first time the presence of the signature and if so, identify which message is encoded.

Any watermarking technique must satisfy three main requirements: invisibility, capacity, and robustness. Based on applications, these requirements evaluate the performance of watermarking systems. In the case of invisible watermarking, the market and the original content should be perceptually indistinguishable from humans. This fidelity can be evaluated qualitatively by asking a group of people to confirm the visual quality of the marked content or quantitatively by calculating several criteria. The standard criteria used to evaluate the invisibility quantitatively are the mean peak signal-to-noise ratio (MPSNR) and mean structural similarity index (MSSIM). In the case where the marked content is an image, PSNR is calculated as shown in the following equation:

$$\begin{aligned}&\displaystyle PSNR = 10\log _{10} \frac{255^{2}}{MSE} (dB) \end{aligned}$$
(1)
$$\begin{aligned}&\displaystyle MSE = \frac{1}{M \times N} \sum _{m=1}^{M} \sum _{n=1}^{N} [f(m,n) - f_{w}(m,n)]^{2} \end{aligned}$$
(2)

where M \(\times \) N is the size of the image, f and \(f_{w}\) are the original and marked images, and MSE is the mean square error between f and \(f_{w}\). In the case of a video and if the number of marked frames is K, we calculate the Mean PSNR as follows:

$$\begin{aligned} MPSNR = \frac{1}{K}\sum _{k=1}^{K} PSNR_{k} \end{aligned}$$
(3)

Despite its simplicity, PSNR or MPSNR cannot sometimes provide subjective evaluation results, so SSIM or MSSIM are introduced to evaluate visual quality of the marked image or video quality. The MSSIM is defined as follows:

$$\begin{aligned}&\displaystyle MSSIM = \frac{1}{k}\sum _{k=1}^{K} SSIM(f_{k},f_{kw}) \end{aligned}$$
(4)
$$\begin{aligned}&\displaystyle SSIM(f_{k},f_{wk})=\frac{(2 \mu _{f_{k}} \mu _{f_{kw}} + C_{1}) (2 \sigma _{f_{k}f_{kw}} + C_{2})}{(\mu _{f_{k}}^{2} + \mu _{f_{kw}}^{2} + C_{1}) (\sigma _{f_{k}}^{2} + \sigma _{f_{kw}}^{2} + C_{2})} \end{aligned}$$
(5)

where \(\mu _{f_{k}}\) and \(\mu _{f_{kw}}\) are the mean values of the original image and the marked one, respectively; \(\sigma _{f_{k}}\) and \(\sigma _{f_{kw}}\) are the variances of the original image and the marked one. \(\sigma _{f_{k}f_{kw}}\) denotes the covariance of the original image and the marked one; and \(C_{1}\) and \(C_{2}\) are two stability constants. We note that there exist watermarking techniques that are visible, but their use is limited to specific applications.

The second requirement is capacity (also called payload) which presents the quantity of embedded information in host media. For several applications, if the watermarking technique needs high invisibility, it is necessary to reduce the signature capacity to avoid too much modification in the host media.

The last requirement is robustness which is the ability to extract the embedded signature even when the marked media undergoes several signal processing manipulations. These manipulations include non-malicious attacks that are unintentional processing that may perturb the embedded signature such as geometric operations (translation, rotation, scaling), noises add, and filtering which can be applied to image or video content and malicious attacks which try to damage or remove the embedded signature. Among these attacks, we distinguish compression attacks and collusion which are specific to video content. Note that, depending on the application, not all watermarking techniques are robust against the same manipulations.

Referring to the robustness level, techniques can be classified into robust, fragile, and semi-fragile watermarking. Robust watermarking requires the watermark to resist noisy operations, as well as geometric or non-geometric manipulations. This class of watermarking is used in different applications such as copyright protection, broadcast monitoring, copy control, and fingerprinting. If the embedded signature is lost or altered after the application of the host content, the watermarking is fragile. This class of watermarking is usually used for integrity verification and content authentication applications. The last type of watermarking is the semi-fragile class that is robust against some attacks, but it fails after malicious manipulations. This class can be used for image authentication applications.

Bit error rate (BER) and normalized correlation (NC) are used to evaluate the robustness of a given watermarking. These two metrics are calculated to compute the dissimilarity between the embedded signature and the extracted one after applying different attacks to the marked content. In fact, the BER provides the percentage of erroneous bits during the transmission, and it is given by the following equation where S is the original signature, S’ is the extracted one, \(\sum _{i} Ber_{i}\) is the number of bit in error, and \(\sum _{i} Btrans_{i}\) is the total number of transmitted bits:

$$\begin{aligned} BER(S,S^{'}) = \frac{\sum _{i} Ber_{i}}{\sum _{i} Btrans_{i}} \end{aligned}$$
(6)

The NC calculates the similarity between two media. It is a value in the range [0,1] where a higher value proves a better similarity between media. Given an original and an extracted signatures S and \(S^{'}\), NC metric is calculated as follows:

$$\begin{aligned} NC(S,S^{'})= \frac{1}{WH} \sum _{i=0}^{W-1} \sum _{j=0}^{H-1} \delta (S_{i,j}, S^{'}_{i,j}) \end{aligned}$$
(7)

where

$$\begin{aligned} \delta (S_{i,j}, S^{'}_{i,j})= {\left\{ \begin{array}{ll} 1, &{} \text {if } S= S^{'} \\ 0, &{} \text {otherwise } \end{array}\right. } \end{aligned}$$

The signature capacity, invisibility, and robustness are mutually restricted. Indeed, the most difficult challenge in the research of image and video watermarking area is how to choose embedding target that minimize the visual impact and have a high robustness and an acceptable capacity in the same technique.

2.2 Robust Traditional Image and Video Watermarking Techniques Classification

The main criterion used for image and video watermarking techniques classification is the embedding domain which can be spatial, frequency or hybrid domain.

Spatial watermarking embeds signature by directly modifying the luminance or the chrominance of the original image or video frame pixels. Spatial techniques are characterized by their low complexity and high invisibility. However, they suffer from the lack of robustness against several attacks. The main spatial domain techniques proposed for image and video watermarking include least significant bit (LSB) modification, spread spectrum modulation, and so on.

Concerning image content, LSB is the most used for the spatial domain where the least significant bit of several selected pixels is modified to embed signature [58]. LSB is very simple, but it fails to resist several attacks. For this reason, alternative methods, such as MIDSB (Middle significant bit) [12] and ISB (intermediate significant bit) [62] where the least significant bit was replaced, respectively, by the Middle significant bit and the best pixel value in between the Middle and the edge of the range, have been developed to improve the robustness. Other spatial techniques were proposed [74] to improve robustness while keeping the visual quality level. For video, LSB is also the most classical technique with the same method used for image watermarking while applying LSB for all or some selected frames composed the original video [42]. Despite the simplicity of the LSB technique, its robustness is very poor. The spread spectrum techniques were proposed as an effective spatial watermarking where the original video frames are scanned to obtain a one-dimensional signal and the signature is modulated by spread spectrum technology and inserted in the video [60]. Other spatial video watermarking techniques were also proposed in [8, 48, 82] to improve robustness against attacks. However, the application of these techniques is limited due to their poor robustness, especially with the development of video coding technology.

Frequency domain-based watermarking converts the original content (image or video frames) using a chosen transform and then modify the obtained coefficients to embed the signature. After that, the coefficients are converted back to the spatial domain to obtain the marked content. The most used frequency domain transforms for image watermarking are the discrete cosine transform (DCT) [38, 54, 75, 80, 83], discrete Fourier transform (DFT) [23, 37, 64, 70] discrete wavelet transform (DWT) [5, 32, 46, 85], and singular value decomposition (SVD) [4, 81]. Every frequency transform presents its own advantages and disadvantages where some transforms are robust against several attacks while they fail against others. For example, the spatial domain usually ensures robustness against translation and noises but it does not resist to compression and filtering contrary to DCT which is robust against rotation, filtering, and JPEG compression but it fails to resist noises. To resolve this problem, several image watermarking algorithms are based on the hybrid domain which combines different transforms with spatial domain together to profit from the advantages of these transforms [2, 13, 76]. Note that these algorithms ensure the best trade-off between robustness, capacity, and invisibility.

Concerning video content, like image, the common frequency domain transforms include DCT [18, 34, 49, 89], DWT [15, 30, 72, 79], and SVD which is usually combined with another transform as DWT [33, 77] and DCT [61]. As concluded for the image, the robustness of the video watermarking techniques depends on the characteristics of the chosen transform. However, to better improve performance, many watermarking algorithms use the hybrid domain that combines the advantages of the different transformations. Therefore, different techniques were proposed combining DCT and DWT [39, 73] or combining different transforms with the spatial domain as suggested in [44].

Since video content can be considered as a set of frames, any image watermarking technique can be adopted for video watermarking by embedding the signature into spatial redundancy of all or some selected frames. However, image-based techniques cannot resist video-specific attacks. In fact, video is also defined by temporal information which makes its processing more sensitive and the temporal redundancy in a video gives more chances to hackers to estimate signatures by using malicious attacks such as collusion. This last attack and frame-based attacks such as compression frames dropping and swapping should be considered by researchers when developing watermarking techniques for video. To resist these attacks, different techniques based on temporal information, such as mosaic [45], multi-sprites [11], and Krawtchouk moments [10], have been proposed and they proved their good robustness against malicious attacks, especially against collusion attack.

As video data is nowadays frequently used and transmitted on the Internet, the compression process is usually applied to reduce video size. However, watermarking techniques based on the original video decode the video during signature embedding and detecting stages and can destroy the signature and deteriorate the visual quality. To resolve this problem, a new class of video watermarking algorithms has emerged where the compressed domain is used. These algorithms embed signatures into compressed videos and combine the embedding stage with corresponding video coding standards which include MPEG [21, 22, 90], H.264 [27, 98], and H.265 [24, 55, 71]. Compressed domain-based watermarking is robust against several attacks such as filtering, noises, and compression.

In summary, the classification of the traditional robust image and video watermarking techniques is illustrated in Fig. 4.

Fig. 4
figure 4

Classification of the traditional robust image and video watermarking techniques

3 Basic Concepts of Deep Learning-Based Watermarking

With the success of deep learning in computer vision and image processing domains, it has been adopted for various tasks. Recently, deep learning models have attracted the attention of researchers in data hiding techniques including steganography [9, 25] and watermarking.

3.1 General Framework of Deep Learning-Based Watermarking Schemes

Deep learning-based watermarking usually uses an encoder–decoder based on convolutional neural networks (CNNs) structure to train models and to embed them in a robust and invisible way the signature. It is more efficient than traditional watermarking thanks to its advantage to be retrained to resist several attacks. In addition, it does not need an expert to develop the embedding method. Finally, the black-box nature of deep learning models allows for improving security.

The deep learning-based watermarking scheme is decomposed into three main stages as shown in Fig. 5. The first stage is the encoder which embeds the signature in the original content. The second stage is attack simulation and finally, a signature is extracted using the decoder network stage. Thanks to the iterative learning process, the embedding is more robust against attacks applied during the second stage, and the extraction network improves the integrity of the extracted signature. The main advantage of deep learning-based watermarking over traditional watermarking is that it can be easily retrained for various applications and different attacks instead of being designed from scratch.

Fig. 5
figure 5

Encoder–decoder architecture stages for digital watermarking

An image or video watermarking scheme based on deep learning works as follows:

  1. 1.

    Training the encoder network to embed input messages to original content where the main goal is to minimize an objective function. This function calculates both the difference between original content and marked content and between the embedded and extracted signatures.

  2. 2.

    Applying different attacks to the marked content through distortion layers. These attacks can include different forms of manipulations such as cropping and compression.

  3. 3.

    Extracting the embedded message from distorted content using the decoder network.

3.2 Neural networks Architectures Used in Watermarking

Deep learning frameworks utilize automatic learning to capture hierarchical information directly from training data, eliminating the need for manual feature representations. Specifically, a deep network takes raw input data, such as an image or audio signal, and performs a mapping operation. Due to their impressive capability to imitate human brain learning abilities and engage in more natural interactions, deep learning techniques have gained widespread usage in data hiding and image processing applications.

Two deep learning models are widely used in watermarking techniques: convolutional neural network (CNN) and generative adversarial network (GAN).

CNNs are well suited for different applications such as classification and recognition, thanks to their efficiency in data representation with limited number of parameters [50]. The CNN algorithm is a specialized multilayer perceptron primarily developed for extracting and recognizing two-dimensional image details. The CNN architecture typically consists of multiple layers, including an input layer, convolutional layers, pooling layers, and an output layers shown in Fig. 6. The CNN initiates by taking an input image and subjecting it to a series of convolutions and subsampling operations. Each convolution layer comprises a collection of filter matrices, which are multiplied with the preceding image matrix to extract significant features referred to as output channel maps. Subsequently, pooling layers are employed to decrease the dimensions of the input map while preserving crucial information. Max pooling, a subsampling technique, selects the maximum value within each block. Nonlinearity is introduced into the network through activation functions like the rectified linear unit (ReLU), which sets negative values to zero. To mitigate overfitting and expedite learning, batch normalization can be employed during network training.

Fig. 6
figure 6

CNN architecture

Concerning GAN, it is a type of neural networks widely employed in unsupervised learning. GAN consists of two neural network models that engage in a competitive process, enabling them to examine, grasp, and replicate the diverse patterns present in a given dataset. In fact, GAN is decomposed of two models: generative and discriminative model. It has the same principle as the encoder–decoder described in Fig. 5 with a difference from the discriminator network which classifies the mixture of encoded and unaltered images that are given to it (Fig. 7). The use of these discriminative networks can greatly improve data imperceptibility.

Fig. 7
figure 7

GAN architecture

3.3 Examples of Datasets for Watermarking

To assess the performances of a deep learning-based watermarking scheme, different datasets were used in the literature. Among these datasets, we mention:

  • ImageNet: ImageNet is a widely used dataset in computer vision research, consisting of millions of labeled images across thousands of categories. While not specifically designed for watermarking, it can be used to evaluate the effectiveness of watermarking techniques on various types of images.

  • MS COCO (Microsoft Common Objects in Context): MS COCO is another popular dataset used for object detection and image segmentation tasks. It contains a large collection of images with diverse content, making it suitable for watermarking research.

  • BOSSbase (BOWS-2): BOSSbase is a benchmark dataset for digital image watermarking. It contains 10,000 grayscale images with a resolution of 512x512 pixels. The dataset includes both the original images and the corresponding watermarked versions, making it suitable for evaluating the robustness and imperceptibility of watermarking algorithms.

  • UCF101: UCF101 is a dataset commonly used for action recognition in videos. It consists of 13,320 videos covering 101 action categories. While primarily used for action recognition, it can be employed to evaluate video watermarking techniques on action-based content.

  • The Kinetics dataset is a large-scale video dataset commonly used for action recognition tasks. It consists of approximately 650,000 video clips covering 700 action categories. The dataset is diverse and includes a wide range of human actions captured from YouTube videos. While the Kinetics dataset is not designed specifically for watermarking research, it can still be useful for evaluating certain aspects of watermarking techniques on action-based video content.

4 Deep Learning-Based Image Watermarking Review

While the current research on deep learning-based watermarking predominantly revolves around image watermarking, other forms of media are still in an early stage of development. Only a limited number of works have been proposed for text [1] and 3D images [92]. These approaches offer improved efficiency compared to traditional techniques by leveraging their ability to learn complex insertion patterns that are resilient against various attacks. This robustness is obtained since the networks of deep learning can be easily retrained to become robust to different types of attacks. Moreover, they can target capacity payload or imperceptibility optimization without developing new algorithms for each different application. Deep learning models are characterized by their high nonlinearity which makes the retrieval of the embedded signature impossible by an adversary.

4.1 Classification of Deep Learning-Based Image Watermarking Schemes

Current deep learning-based image watermarking techniques can be categorized into two classes based on the chosen network architecture. The first class uses the encoder–decoder framework including CNNs where we distinguish techniques which are based CNN encoder–decoder (Fig. 5) and those based on the convolutional auto-encoders which are a special case of the encoder–decoder used in unsupervised-learning scenarios.

Two traditional convolutional auto-encoders for watermark embedding and extraction were proposed in [41]. These auto-encoder CNN models allow for obtaining high invisibility of the embedded signature. Moreover, the watermarking proposed in [41] proved its efficiency in terms of robustness and outperforms the traditional watermarking techniques. Another convolutional auto-encoder-based robust and blind watermarking technique was proposed in [63]. This approach is decomposed into three steps: embedding, attack simulation, and updating. In the second step, the CNN simulates the various attacks while in the updating, the loss function is minimized by updating the model’s weights.

In [78], the authors present a method of watermarking digital images using CNNs. First, an encoder network is used to extract latent features from the cover and secret images. These features are then concatenated to create a marked image. On the receiving end, a CNN is used to retrieve the secret marked image after removing noise variations from the received image using a denoising auto-encoder network.

Ahmadi et al. [3] presents a new approach called ReDMark which uses two full convolutional neural networks (FCNNs) for embedding and extraction. It contains a differentiable attack layer to simulate different distortions. This technique improves robustness against attacks and maximizes the trade-off between robustness and imperceptibility. Zhong et al. [99] proposes a CNN-based watermarking technique which is robust and blind and can be used for several applications. This technique generalizes the watermarking process by training a deep neural network to learn the general rules of watermark embedding and extraction. This technique outperforms the two auto-encoder CNN methods proposed in [41, 63], and allows obtaining greater robustness. Another watermarking model developed in [47] uses a simple CNN for both embedding and extraction. It contains an image pre-processing network that can adapt images of any resolution for the watermarking process and a watermark pre-processing as well as a strength scaling factor to control the trade-off between robustness and imperceptibility.

Luo et al. [57] improve the CNN-based encoder–decoder framework by adopting trained CNNs for attack simulation instead of using a differentiable attack layer. The addition of adversarial components to model training can improve the robustness of the embedded mark. In fact, in [57], the distortions are generated via adversarial training by a trained CNN.

In [68], an optimized deep fusion convolutional neural network (FCNN)-based digital color image watermarking scheme was proposed for copyright protection. It suggests a deep fusion CNN that uses an optimization method as its basis. The octave convolutional module added by the embedding network reduces spatial redundancy and increases the receptive field. The ECO method can help choose a suitable strength factor with great exploration capabilities.

The second class of deep learning-based image watermarking is based on generative adversarial networks (GAN) [28]. Several variants of the GAN exist, and they include Wasserstein GANs (WGANs) and CycleGANs which are used for image watermarking and provide good results in terms of invisibility and robustness. HiDDeN [100] is the first scheme which uses an adversarial discriminator to improve the performance of the watermarking process. It is decomposed of an encoder network which trains to embed an encoded bit string, a decoder network which tries to extract the information from the encoded image and an adversary network which predicts if the image was encoded or not.

ROMark [87] and [31] improve the HiDDeN technique where the goal of [87] is to minimize decoding loss across a range of attacks, rather than training the model to resist specialized attacks. This technique is more robust than [100] in some specialized attack categories. Concerning [31], it uses a rotation layer and an additive noise layer, allowing the model to learn robustness against geometric rotation attacks. It also uses a noise strength factor to maximize the trade-off robustness/invisibility. Zhang et al. [96] proposed a new GAN-based watermarking technique which uses inverse gradient attention (IGA) to embed signature. This technique identifies pixels that are robust based on an attention mask which provides the values of the gradient of the original image. This allows for improving the capacity and robustness of the marked images compared with other techniques. Another GAN-based watermarking was proposed in [52] where TSDL (two-stage separable deep learning) framework was introduced. This framework can use true non-differentiable noise attacks such as JPEG compression during training. Liu et al. [52] achieves good robustness compared with the previous techniques.

Annadurai et al. [6] presents an approach of digital watermarking based on discrete wavelet transform (DWT) quantization model with convolutional generative adversarial neural networks which are used for segmentation and classification of the processed image. In this technique, the SVD-based discrete wavelet transform quantization model is used for watermarking.

Other watermarking techniques using GAN variants were proposed. The first used variant is Wasserstein GAN (WGAN) which improves the stability during training and the sensitivity of the training of the GAN model [7]. WGANs contain a critic component rather than a discriminator component that returns a score which indicates if the input image is real or not.

Plata et al. [66] proposed a new watermarking based on WGAN where the signature is spread over the spatial domain of the image. The suggested technique uses a new method for differentiable noise approximation of non-differentiable distortions which allows the simulation of subsampling attacks. In [67], the authors improve the previous work by using a double discriminator/detector architecture. The discriminator is placed after the noise layer and learns to distinguish watermarked and non-watermarked images with attacks already applied. Wang et al. [86] proposed a technique which enhances the quality of the encoded image based on texture analysis. The texture of the original image is analyzed using a gray co-occurrence matrix which classifies regions into complex and flat types.

The second variant of GAN used for image watermarking is the CycleGAN [101] which includes two generative and two discriminative models. [94] is the only watermarking technique which uses this framework. This technique uses an attention model to embed data that an attention mask, which represents the attention sensitiveness of each pixel in the cover image. This enhances the embedding process of the encoder network.

4.2 Comparison of Deep Learning-Based Image Watermarking Schemes

Figure 8 illustrates the classification of image watermarking using deep learning, depicting the fluctuation in the number of proposed papers according to the employed architecture. Table 1 summarizes the differences between the various techniques of deep learning-based image watermarking techniques proposed in the literature. In fact, based on the study of the art, we can observe that the GAN is more efficient and promising in terms of robustness, thanks to the inclusion of an adversarial network which greatly improve the invisibility of the watermarking. However, to improve the robustness GAN-based watermarking, it should be combined with other methods as proposed in [94] and [96] where an attention mechanism and IGA method were used. Note that the robustness depends on the attack types used during training. Finally, we note that every class presents robustness against a set of attacks and all techniques except [41] are blind.

Fig. 8
figure 8

Number of deep learning-based image watermarking articles per Architecture

Regarding the invisibility criterion, Fig. 11 illustrates the comparison of PSNR values obtained after the application of various existing deep learning-based watermarking techniques. While exploring this figure, we can observe that most deep learning-based image watermarking techniques provide high visual quality for the watermarked image. Nevertheless, we notice that CNN-based techniques offer the best PSNR values, and this is achieved through its capacity to directly align the intricate mapping between low-resolution and high-resolution images. This alignment enhances the recovery of lost high-frequency information, surpassing the performance of numerous conventional methods.

Table 1 Comparison of the existing deep learning-based image watermarking techniques

5 Deep Learning-Based Video Watermarking Review

Despite the good number of existing techniques of deep learning-based watermarking proposed for images, video watermarking based on deep learning has only recently begun to be explored and is still an open problem. In fact, as far as we know, there are only a very few number of video watermarking techniques based on deep learning in the literature [16, 26, 35, 40, 43, 56, 59, 65, 91, 97] that appeared since 2019.

5.1 Classification of deep Learning-Based Video Watermarking Techniques

Deep learning-based video watermarking techniques can be classified based on the original video domain which can be the original frames or the compressed domain. In fact, [16, 26, 35, 40, 43, 56, 59, 65, 91, 97] have been proposed for video frames where [16, 35, 43, 56, 65, 91, 97] are proposed for robust multi-bit embedding. [26] is a robust zero watermarking and [40] is proposed recently for compressed videos. Finally, Mansour et al. [59] is based on the mosaic image generated from original video (Fig. 9).

Fig. 9
figure 9

PSNR comparison of the deep learning-based image watermarking existing techniques

Zhang et al. [97] introduces a new architecture called RIVAGAN, for robust video watermarking composed of two adversaries: a critic and an adversary network. The first one evaluates the quality of the marked video, and the second one tries to remove the watermark. These two components work with the encoder and decoder networks which, respectively, embed and extract the watermark for the video. The proposed architecture is based on an attention-based mechanism which identifies regions that are robust for embedding and generates marked regions with high visual quality. The attention module is composed of two convolutional layers shared between the encoder and decoder. It generates an attention mask from the original frames by applying the two convolutional blocks. This mask contains the data, time, and size dimensions. This mechanism makes it easy the training and enhances the robustness against different attacks such as scaling and compression.

Luo et al. [56] is another multi-bit robust video watermarking using deep learning called DVMark. It is composed of four stages: an encoder, a decoder, a distortion layer, and a GAN discriminator. In fact, the encoder is a multiscale network that embeds the signature, which is repeated across spatial and temporal dimensions, in the original video on two different spatial–temporal scales. A scalar factor is used to change the signature strength at the time of inference. Concerning the decoder, network is composed of a transform layer and two detector heads which can detect marked frames from unmarked ones. In the distortion layer, different distortions like frame dropping, cropping, and compression are applied to the marked video and the decoder network generates a predicted message from the distorted video. Finally, the multiscale video discriminator’s architecture with its 3D ResBlock allows the network to detect both spatial and temporal differences between the cover and watermarked videos. This approach was compared with a 3D-DWT traditional video watermarking and with the HiDDeN method and it was found more efficient in terms of robustness against attacks.

Bistron and Piotrowski introduce in [16] a video watermarking algorithm that merges CNNs with an entropy-driven information mapper. Their approach involves incorporating the watermark into the YUV color space’s luminance channel. By utilizing an information mapper, intricate multi-bit binary signatures can be embedded into the watermark of a signal frame. Although the article acknowledges the utilization of CNNs and an entropy-based information mapper to enhance resilience, it overlooks the algorithm’s effectiveness when faced with advanced watermarking attacks like geometric transformations, compression, cropping, and collusion.

In [43], the authors aim to create a video watermarking system using curriculum learning approaches and deep neural networks. The attention module is a part of the encoder and decoder component of RivaGAN. Overall, an encoder network that is hidden from the decoder network interrupts the segmented videos’ first frame.

The suggested method in [65] uses an Improved Invasive Honey Badger Optimization (IIHBO) algorithm to embed hidden audio components into videos. The process consists of two primary stages: extraction and embedding. Using a Shepard convolutional neural network (ShCNN) trained by IIHBO, the secret audio is incorporated into the predicted object position during the embedding phase. Using the same methods in reverse, the extraction step extracts the hidden audio from the embedded video. For effective ShCNN training, the IIHBO—a hybrid of Improved Invasive Weed Optimization (IIWO) and Honey Badger Optimization (HBO)—is employed.

Reversible medical video watermarking with a Deep CNN based on SCBSA is discussed in [35]. In order to integrate videos, the approach combines the Sine Cosine and Bird Swarm Algorithm (SCBSA), which includes key frame extraction, region identification, and embedding. From gridded video frames, features like CNN, LOOP, neighborhood-based, and histogram features are retrieved. The appropriate area for embedding a hidden message is determined by the Deep CNN classifier, which has been trained using SCBSA. Using a two-level decomposition based on wavelet transform, the SCBSA, a hybrid technique of SCA and BSA, makes message embedding and extraction simpler.

ItoV [91] presents an approach that adapts image watermarking techniques based on deep learning to video watermarking. The main goal is to address issues like computational complexity and temporal interdependence that are present in video data. The authors concentrate on combining the channel and temporal dimensions of videos so that deep neural networks can process videos like images. They investigate how different convolutional blocks affect video watermarking and find that, although depthwise convolutions greatly lower computational costs with no effect on performance, spatial convolution is essential. In watermark embedding, the neural network’s task is to understand the cover video’s pixel distribution so that messages can be added with the least amount of distortion.

Gao et al. [26] proposed a robust zero-watermarking technique for copyright protection of videos contents. This technique is based on a CNN architecture with a self-organizing map (SOM) in polar complex exponential transform (PCET) space. First, the scheme extracts the feature for the frames composing the original video using CNN. Then, it selects some significant frames by applying SOM clustering and maximum entropy. Given the selected frames, the invariant moments are detected using PCET and the dimensions are reduced by singular value decomposition (SVD). The obtained moments will be used to generate the binary matrix. Finally, the zero-watermark signal is generated by applying a bitwise exclusive-OR operation on the binary matrix and the watermark is encrypted by the chaotic map. The experiments showed that this technique is robust against several attacks such as geometric, compression, and inter-frame attacks and proved a superior efficiency compared with existing video zero-watermarking and traditional video watermarking methods.

The deep learning-based video watermarking technique proposed in [40] is a recent method which works in the compressed domain for protecting encoded videos with H.265/ HEVC codec compression. First, the encoder subsystem generates the watermark by applying the adjustable subsquares properties. This watermark will be introduced to the preliminary network. The encoder DNN takes as input the hidden image with the original one and decomposes the secret image from the preliminary network to the set of features in order to encode the watermark. Then, the deconvolution of the obtained set of secret image features is carried out with the carrier image. During the learning process, the neural network automatically selects the optimal filters by which the image modification can be carried out. The encoding is done with the adjustable subsquares properties algorithm to obtain a bit-encoded image. After the image has passed through the compression channel of the HEVC codec, it will be decoded using a decoder network with return the recovered watermark which is processed by the decoder subsystem to identify the watermark by recognizing the information encoded in the recovered image. This technique presents a high visual quality of the marked video.

Recently, a novel approach to video watermarking utilizing deep learning and employing a mosaic image has been presented in [59]. This method extends the concept of image watermarking to video watermarking. It involves four key steps: pre-processing networks for generating the mosaic image from the original video and for handling the signature, an embedding network, attack simulation, and an extraction network. The primary objective of generating the mosaic image is to construct an image from the original video while ensuring resilience against malicious attacks, particularly collusion attacks. The proposed technique adjusts the resolution of the mosaic image and incorporates signature information, incorporating various CNN training methods like averaging, batch normalization, and rectified linear unit (ReLU). During the attack simulation phase, all attacks, except collusion and MPEG compression, are included in each mini-batch.

5.2 Comparison of Deep Learning-Based Video Watermarking Techniques

Figure 10 illustrates the distribution of existing works based on the type of architecture used. It shows that the CNN architecture is the most commonly employed for video watermarking. The widespread adoption of CNN architecture for video watermarking can be attributed to several reasons. First, CNNs are designed to automatically extract relevant features from input data, which is crucial for video watermarking where specific patterns need to be identified and incorporated imperceptibly. Then, videos contain complex and dynamic information. CNNs, with their deep architecture, can capture complex relationships between successive frames, which is essential for video watermarking. Finally, CNNs are designed to be invariant to translations and deformations, making them robust to minor modifications in an image an important property for video watermarking where the video may undergo alterations.

Fig. 10
figure 10

Distribution of deep learning-based video watermarking articles based on architecture

Table 2 summarizes the advantages and the differences between the existing deep learning-based video watermarking techniques. These techniques use different network architectures and different domains of embedding. Their robustness depends on the used domain and the chosen architecture. Undoubtedly, the specific procedure of deep learning-based video watermarking schema may differ based on the employed technique, but overall, it entails training a deep neural network to comprehend the video’s characteristics along with the watermark. Subsequently, this trained network is utilized to insert the watermark into the video by modifying the video frames in a way imperceptible to the human eye. Likewise, the same neural network can extract the watermark from the watermarked video. One significant advantage of deep learning-based video watermarking methods is their superior ability to effectively address video watermarking challenges, such as motion and compression artifacts, surpassing the capabilities of conventional approaches.

Table 2 Comparison of existing deep learning-based video watermarking techniques

Although numerous papers referenced in this study employ various embedding techniques, it is not possible to definitively determine the optimal solution for the video watermarking process. Notably, certain research indicates that treating video frames as images and employing conventional digital image watermarking methods for embedding may be an option. However, this approach hinders the neural network’s ability to acquire valuable video-specific features, thereby posing challenges in effectively countering distortions specifically aimed at videos.

Figure 11 compares the visual quality of existing video watermarking techniques based on deep learning, relying on PSNR values. We can observe that these values are close and range between approximately 34 and 44 dB. These values demonstrate the invisibility guaranteed by these watermarking techniques.

Fig. 11
figure 11

Invisibility comparison of the deep learning-based video watermarking existing techniques

In summary, deep learning-based approaches for video watermarking surpass conventional methods in multiple aspects. These benefits comprise the watermark’s substantial capacity and imperceptibility, resilience against different attacks, versatility across diverse video formats and resolutions, and proficiency in addressing challenges associated with video watermarking, such as motion and compression artifacts.

6 Discussion and Suggestions for Future Research

Deep learning for watermarking is a recent and evolving research domain. As shown in this survey, existing works were all focused on image watermarking, but there are many other important applications for watermarking video with deep learning. As far as we know, only the techniques of video watermarking based on deep learning described in this survey were proposed despite the large number of traditional watermarking techniques proposed for video. In fact, video content continues to present additional challenges, such as temporal coherence, which is a spatial location that cannot be resolved with fixed images. Moreover, video compression is not differentiable, and it is difficult to integrate it into a deep neural network training framework. Furthermore, it is not easy to visualize a robust model that uses temporal correlations in a video while maintaining temporal coherence and perceptual quality. Thus, for videos, deep learning-based watermarking is still in its early stages. However, given the urgent need of protecting videos and the efficiency in terms of invisibility and robustness that can offer deep learning networks, we expect that the above challenges will be the focus of extensive research that will take up most of the academics’ time in the coming years.

Based on the state of the art presented in this paper, we note that CNN and GAN are the most used architectures for image and video watermarking. These two architectures present different challenges. In fact, the difficulties encountered by CNN models in watermarking systems are outlined below:

  • CNN models experience increased latency due to operations like max-pool.

  • Longer training times can be incurred occasionally due to misconfigured network parameters.

  • Larger datasets are necessary for the training and processing of CNN models.

  • The complexity of CNN networks can lead to issues like overfitting or underfitting at times.

  • Applying CNN model to video watermarking, which involves the processing of multiple frames and temporal dependencies, can be time-consuming and resource-intensive.

Concerning GAN model, the posed challenges are as follows :

  • Overfitting arises from discrepancies between the generator and discriminator networks.

  • The network parameters’ oscillation and destabilization prevent convergence.

  • In certain cases, the discriminator becomes overly adept, leading to the vanishing of the generator gradient and a lack of learning.

  • The generator network occasionally gets stuck, resulting in limited variations of the samples.

However, many other efficient deep learning architectures were developed for other applications such as classification and recognition and we recommend exploring them in watermarking schemes. For an example, RNN (recurrent neural network) was used for many tasks for video content and can provide good results for video watermarking.

Additionally, research findings indicate that transform domain-based watermarking techniques exhibit greater resilience compared to spatial domain-based approaches. Therefore, it is recommended to combine multiple frequency domains in the same image watermarking scheme to achieve enhanced security. Besides, to reduce the complexity of model training, pre-trained models are extensively employed, giving rise to challenges such as model overwriting and surrogate model attacks. In fact, a pre-trained model is a deep learning model that have been trained on large datasets and can be used as a starting point for various tasks without training from scratch as YOLO model.

Moreover, since many deep learning-based image watermarking techniques, that are robust and invisible, have been proposed in the literature, we can profit from their advantages by adapting them for video watermarking. In fact, an image can easily be generated from video with a reversible scheme allowing to return to the original video such as mosaic generation and Krawtchouk Matrix generation. By transforming a video to an image with a reversible algorithm, we can apply image watermarking to the obtained image to embed a signature into a video.

Note that the problem with some proposed deep learning-based video watermarking techniques is that they did not focus on testing the robustness of the method against malicious attacks such as collusion (type I and II) which are very dangerous attacks that should be considered when developing a video watermarking technique.

7 Conclusion

An overview of deep learning techniques used in watermarking and applied in images and video is presented in this survey paper. Firstly, watermarking terminology was presented, and traditional image and video watermarking techniques were classified based on their embedding domain. Then, deep learning-based image watermarking was classified and compared based on their network architecture. The survey also compared the four existing deep learning-based video watermarking proposed recently. Finally, this paper provided possible suggestions for future research in the field of video watermarking based on deep learning. This last one is a promising recent field of research with the potential to revolutionize the protection and security of video communication. In a conclusion, we can confirm that deep learning-based methods for watermarking will greatly surpass the capabilities of any traditional watermarking techniques in all media and greatly enhance digital information security.