This section introduces the DeeperForensics-1.0 dataset [41]. The dataset consists of 60, 000 videos with 17.6 million frames in total, including 50, 000 collected source videos and 10, 000 manipulated videos. Toward building a dataset that is suitable for real-world face forgery detection, DeeperForensics-1.0 is designed with the careful consideration of quality, scale, and diversity. In Sects. 14.3.1 and 14.3.2, we discuss the details of data collection and methodology (i.e., DF-VAE) to improve the quality of data. In Sect. 14.3.3, we show our approaches to increase the scale and diversity of samples.
Data Collection
Source data is the first factor that highly affects quality. Taking results in Fig. 14.2 as an example, the source data collection increases the robustness of our face swapping method to extreme poses, since videos on the Internet usually have limited head pose variations.
We refer to the identity in the driving video as the “target” face and the identity of the face that is swapped onto the driving video as the “source” face. Different from previous works, we find that the source faces play a more critical role than the target faces in building a high-quality dataset. Specifically, the expressions, poses, and lighting conditions of source faces should be much richer in order to perform robust face swapping. The data collection of DeeperForensics-1.0 mainly focuses on source face videos. Figure 14.3 shows the diversity in different attributes of the collected source data.
We invite 100 paid actors to record the source videos. Similar to [13, 24], we obtain consents from all the actors for using and manipulating their faces to avoid the portrait right issues. The participants are carefully selected to ensure variability in genders, ages, skin colors, and nationalities. We maintain a roughly equal proportion w.r.t.each of the attributes above. In particular, we invite 55 males and 45 females from 26 countries. Their ages range from 20 to 45 years old to match the most common age group appearing on real-world videos. The actors have four typical skin tones: white, black, yellow, and brown, with ratio 1:1:1:1. All faces are clean without glasses or decorations.
A professional indoor environment is built for a more controllable data collection. We only use the facial regions (detected and cropped by LAB [96]) of the source data; thus, the background is neglected. We set seven HD cameras from different angles: front, left, left-front, right, right-front, oblique-above, and oblique-below. The resolution of the recorded videos is high (\(1920 \times 1080\)). The actors are trained in advance to keep the collection process smooth. We request the actors to turn their heads and speak naturally with eight expressions: neutral, angry, happy, sad, surprise, contempt, disgust, and fear. The head poses range from \(-90^{\circ }\) to \(+90^{\circ }\). Furthermore, the actors are asked to perform 53 expressions defined by 3DMM blendshapes [14] (see Fig. 14.4) to supplement some extremely exaggerated expressions. When performing 3DMM blendshapes, the actors also speak naturally to avoid excessive frames that show a closed mouth.
In addition to expressions and poses, we systematically set nine lighting conditions from various directions: uniform, left, top-left, bottom-left, right, top-right, bottom-right, top, and bottom. The actors are only asked to turn their heads under the uniform illumination, so the lighting remains unchanged on specific facial regions to avoid many duplicated data samples recorded by the cameras set at different angles. In total, the collected source data of DeeperForensics-1.0 comprise over 50, 000 videos with around 12.6 million frames.
DeepFake Variational Auto-Encoder
To improve the quality of manipulated data in DeeperForensics-1.0, we consider three key requirements in formulating a high-fidelity face swapping method: (1) The method should be generic and scalable to generate a large number of videos with high quality. (2) The problem of face style mismatch caused by the appearance variations should be addressed. Some failure cases in existing datasets are shown in Fig. 14.5. (3) Temporal continuity of generated videos should be taken into consideration.
Based on the aforementioned requirements, we propose DeepFake Variational Auto-Encoder (DF-VAE), a learning-based face swapping framework. DF-VAE consists of three main parts, namely a structure extraction module, a disentangled module, and a fusion module. The details of DF-VAE framework are introduced in this section.
Disentanglement of structure and appearance. The first step of DF-VAE method is face reenactment—animating the source face with similar expression as the target face, without any paired data. Face swapping can be considered as a subsequent step of face reenactment that performs fusion between the reenacted face and the target background. For the robust and scalable face reenactment, we should disentangle the structure (i.e., expression and pose) and appearance (i.e., texture, skin color, etc.) representations of a face. This disentanglement is difficult since the structure and appearance representations are far from independent.
Let \(\mathbf {x}_{1:T}\equiv {\{x_1,x_2,...,x_T\}}\in {X}\) be a sequence of source face video frames, and \(\mathbf {y}_{1:T}\equiv {\{y_1,y_2,...,y_T\}}\in {Y}\) be the sequence of corresponding target face video frames. We first simplify our problem and only consider two specific snapshots at time t, \(x_t\), and \(y_t\). Let \(\tilde{x}_t\), \(\tilde{y}_t\), \(d_t\) represent the reconstructed source face, the reconstructed target face, and the reenacted face, respectively.
Consider the reconstruction procedure of the source face \(x_t\). Let \(s_x\) denote the structure representation and \(a_x\) denote the appearance information. The face generator can be depicted as the posteriori estimate \(p_\theta \left( x_t|s_x,a_x\right) \). The solution of our reconstruction goal, marginal log-likelihood \(\tilde{x}_t\sim \log {p_\theta \left( x_t\right) }\), by a common variational auto-encoder (VAE) [50] can be written as follows:
$$\begin{aligned} \begin{aligned} \log {p_\theta \left( x_t\right) }=D_{KL}\left( q_\phi \left( s_x,a_x|x_t\right) \Vert p_\theta \left( s_x,a_x|x_t\right) \right) \\ +L\left( \theta ,\phi ;x_t\right) , \end{aligned} \end{aligned}$$
(14.1)
where \(q_\phi \) is an approximate posterior to achieve the evidence lower bound (ELBO) in the intractable case, and the second RHS term \(L\left( \theta ,\phi ;x_t\right) \) is the variational lower bound w.r.t.both the variational parameters \(\phi \) and generative parameters \(\theta \).
In Eq. (14.1), we assume that both \(s_x\) and \(a_x\) are latent priors computed by the same posterior \(x_t\). However, the separation of these two variables in the latent space is rather difficult without additional conditions. Therefore, DF-VAE employs a simple yet effective approach to disentangle these two variables.
The blue arrows in Fig. 14.6 demonstrate the reconstruction procedure of the source face \(x_t\). Instead of feeding a single source face \(x_t\), we sample another source face \(x^\prime \) to construct unpaired data in the source domain. To make the structure representation more evident, we use the stacked hourglass networks [69] to extract landmarks of \(x_t\) in the structure extraction module and get the heatmap \(\hat{x}_t\). Then we feed the heatmap \(\hat{x}_t\) to the Structure Encoder \(E_\alpha \), and \(x^\prime \) to the Appearance Encoder \(E_\beta \). We concatenate the latent representations (small cubes in red and green) and feed it to the Decoder \(D_\gamma \). Finally, we get the reconstructed face \(\tilde{x}_t\), i.e., marginal log-likelihood of \(x_t\).
Therefore, the latent structure representation \(s_x\) in Eq. (14.1) becomes a more evident heatmap representation \(\hat{x}_t\), which is introduced as a new condition. The unpaired sample \(x^\prime \) with the same identity w.r.t. \(x_t\) is another condition, being a substitute for \(a_x\). Equation (14.1) can be rewritten as a conditional log-likelihood:
$$\begin{aligned} \begin{aligned} log{p_\theta \left( x_t|\hat{x}_t,x^\prime \right) }=D_{KL}\left( q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) \Vert p_\theta \left( z_x|x_t,\hat{x}_t,x^\prime \right) \right) \\ +L\left( \theta ,\phi ;x_t,\hat{x}_t,x^\prime \right) . \end{aligned} \end{aligned}$$
(14.2)
The first RHS term KL-divergence is non-negative, we get the following:
$$\begin{aligned} \begin{aligned}&\log {p_\theta \left( x_t|\hat{x}_t,x^\prime \right) }\ge {L(\theta ,\phi ;x_t,\hat{x}_t,x^\prime )}\\&=\mathbb {E}_{q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) }\left[ -\log {q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) }+\log {p_\theta \left( x_t,z_x|\hat{x}_t,x^\prime \right) }\right] , \end{aligned} \end{aligned}$$
(14.3)
and \(L(\theta ,\phi ;x_t,\hat{x}_t,x^\prime )\) can also be written as follows:
$$\begin{aligned} \begin{aligned} L\left( \theta ,\phi ;x_t,\hat{x}_t,x^\prime \right) =&-D_{KL}\left( q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) \Vert p_\theta \left( z_x|\hat{x}_t,x^\prime \right) \right) \\&+\mathbb {E}_{q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) }\left[ \log {p_\theta \left( x_t|z_x,\hat{x}_t,x^\prime \right) }\right] . \end{aligned} \end{aligned}$$
(14.4)
We let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:
$$\begin{aligned} \begin{aligned} \log {q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) }\equiv {\log {\mathcal {N}\left( z_x;\mathbf {\mu },\mathbf {\sigma ^2}\mathbf {I}\right) }}, \end{aligned} \end{aligned}$$
(14.5)
where \(\mathbf {I}\) is an identity matrix. Exploiting the reparameterization trick [50], the non-differentiable operation of sampling can become differentiable by an auxiliary variable with independent marginal. In this case, \(z_x\sim {q_\phi \left( z_x|x_t,\hat{x}_t,x^\prime \right) }\) is implemented by \(z_x=\mu +\sigma \epsilon \) where \(\epsilon \) is an auxiliary noise variable \(\epsilon \sim \mathcal {N}(0,1)\). Finally, the approximate posterior \(q_\phi (z_x|x_t,\hat{x}_t,x^\prime )\) is estimated by the separated encoders, Structure Encoder \(E_\alpha \) and Appearance Encoder \(E_\beta \), in an end-to-end training process by standard gradient descent.
We discuss the whole workflow of reconstructing the source face. In the target face domain, the reconstruction procedure is the same, as shown by orange arrows in Fig. 14.6. During training, the network learns structure and appearance information in both the source and the target domains. It is noteworthy that even if both \(y_t\) and \(x^\prime \) belong to arbitrary identities, our effective disentangled module is capable of learning meaningful structure and appearance information of each identity. During inference, we concatenate the appearance prior of \(x^\prime \) and the structure prior of \(y_t\) (small cubes in red and orange) in the latent space, and the reconstructed face \(d_t\) shares the same structure with \(y_t\) and keeps the appearance of \(x^\prime \). DF-VAE framework allows concatenations of structure and appearance latent codes extracted from arbitrary identities in inference and permits many-to-many face reenactment.
In summary, DF-VAE is a conditional variational auto-encoder [49] with robustness and scalability. It conditions on two posteriors in different domains. In the disentangled module, the separated design of two encoders \(E_\alpha \) and \(E_\beta \), the explicit structure heatmap, and the unpaired data construction jointly force \(E_\alpha \) to learn structure information and \(E_\beta \) to learn appearance information.
Style matching and fusion. To fix the obvious style mismatch problems as shown in Fig. 14.5, we adopt a masked adaptive instance normalization (MAdaIN) module in DF-VAE. We place a typical AdaIN [35] network after the reenacted face \(d_t\). In the face swapping scenario, we only need to adjust the style of the face area to match the original background. Therefore, we use a mask \(m_t\) to guide AdaIN [35] network to focus on style matching of the face area. To avoid boundary artifacts, we apply Gaussian Blur to \(m_t\) and get the blurred mask \(m_t^b\).
In our face swapping context, \(d_t\) is the content input of MAdaIN, and \(y_t\) is the style input. MAdaIN adaptively computes the affine parameters from the face area of the style input:
$$\begin{aligned} \begin{aligned} \mathrm {MAdaIN}\left( c, s\right) =\sigma \left( s\right) \left( \frac{c-\mu \left( c\right) }{\sigma \left( c\right) }\right) +\mu \left( s\right) , \end{aligned} \end{aligned}$$
(14.6)
where \(c=m_t^b\cdot {d_t}\), \(s=m_t^b\cdot {y_t}\). With the low-cost MAdaIN module, we reconstruct \(d_t\) again by Decoder \(D_\delta \). The blurred mask \(m_t^b\) is used again to fuse the reconstructed image with the background of \(y_t\). At last, we get the swapped face \(\overline{d}_t\).
The MAdaIN module is jointly trained with the disentangled module in an end-to-end manner. Thus, by a single model, DF-VAE can perform many-to-many face swapping with obvious reduction of style mismatch and facial boundary artifacts (see Fig. 14.7 for the face swapping between three source identities and three target identities). Even if there are multiple identities in both the source domain and the target domain, the quality of face swapping does not degrade.
Temporal consistency constraint. Temporal discontinuity of the fake videos generated by certain face manipulation methods leads to obvious flickering of the face area, making them easy to be spotted by forgery detection methods and human eyes. To improve temporal continuity, DF-VAE lets the disentangled module learn temporal information of both the source face and the target face.
For simplification, we make a Markov assumption that the generation of the frame at time t sequentially depends on its previous P frames \(\mathbf {x}_{(t-p):(t-1)}\). We set \(P=1\) to balance quality improvement and training time.
To build the relationship between a current frame and previous ones, we further make an intuitive assumption that the optical flows should remain unchanged after reconstruction. We use FlowNet 2.0 [37] to estimate the optical flow \(\tilde{x}_f\) w.r.t. \(\tilde{x}_t\) and \(x_{t-1}\) and \(x_f\) w.r.t. \(x_t\) and \(x_{t-1}\). Since face swapping is sensitive to minor facial details which can be greatly affected by flow estimation, we do not warp \(x_{t-1}\) by the estimated flow like [94]. Instead, we minimize the difference between \(\tilde{x}_f\) and \(x_f\) to improve temporal continuity while keeping stable facial detail generation. To this end, we propose a new temporal consistency constraint, which can be written as follows:
$$\begin{aligned} \begin{aligned} L_{temporal}=\frac{1}{CHW}\Vert \tilde{x}_f-x_f\Vert _1, \end{aligned} \end{aligned}$$
(14.7)
where \(C=2\) for a common form of optical flow.
We only discuss the temporal continuity w.r.t.the source face in this section. The case of the target face is the same. If multiple identities exist in one domain, temporal information of all these identities can be learned in an end-to-end manner.
Scale and Diversity
The extensive data collection and the introduced DF-VAE method are designed to improve the quality of manipulated videos in the DeeperForensics-1.0 dataset. In this section, we mainly discuss the scale and diversity aspects.
The DeeperForensics-1.0 dataset contains 10, 000 manipulated videos with 5 million frames. We take 1, 000 refined YouTube videos collected by FaceForensics++ [81] as the target videos. Each face of our collected 100 identities is swapped onto 10 target videos; thus, 1, 000 raw manipulated videos are generated directly by DF-VAE in an end-to-end process. Thanks to the scalability and multimodality of DF-VAE, the time overhead of model training and data generation is reduced to 1/5 compared to the common DeepFakes methods, with no degradation in quality. Thus, a larger scale dataset construction is possible.
To enhance diversity, we apply various perturbations existing in real scenes. Specifically, as shown in Fig. 14.8, seven types of distortions defined in Image Quality Assessment (IQA) [58, 77] are included. Each distortion is divided into five intensity levels. We apply random-type distortions to the 1, 000 raw manipulated videos at five different intensity levels, producing a total of 5, 000 manipulated videos. Besides, an additional of 1, 000 robust manipulated videos are generated by adding random-type, random-level distortions to the 1, 000 raw manipulated videos. Moreover, in contrast to other datasets [13, 51, 57, 81, 99], each sample of another 3, 000 manipulated videos in DeeperForensics-1.0 is subjected to a mixture of more than one distortion (examples shown in Fig. 14.8). The variety of perturbations improves the diversity of DeeperForensics-1.0 to approximate the data distribution of real-world scenarios better.
Hidden Test Set
Several existing benchmarks [57, 81] have demonstrated high-accuracy face forgery detection results using their proposed datasets. However, the sources and imposed distortions of DeepFakes videos are much more variable and unpredictable in real-world scenarios. Due to the huge biases introduced by a close distribution between the training and test sets, the actual efficacy of these studies [57, 81] in detecting real-world face forgery cases remains to be further elucidated.
An indispensable component of DeeperForensics-1.0 is its introduced hidden test set, which is richer in distribution than the publicly available training set. The hidden test set suggests a better real-world face forgery detection setting: (1) Multiple sources. Fake videos in the wild should be manipulated by different unknown methods; (2) High quality. Threatening fake videos should have high quality to deceive human eyes; (3) Diverse distortions. Different perturbations should be taken into consideration. The ground truth labels are hidden and are used on the host server to evaluate the accuracy of detection models. The hidden test set will evolve by including more challenging samples along with the development of DeepFakes technology.
Overall, DeeperForensics-1.0 is a new large-scale dataset consisting of over 60, 000 videos with 17.6 million frames for real-world face forgery detection. Good-quality source videos and manipulated videos constitute two main contributions of this dataset. The high-diversity perturbations applying to the manipulated videos enhance the robustness of DeeperForensics-1.0 to simulate real scenes. The dataset has been released, free to all research communities, for developing face forgery detection and more general human-face-related research.Footnote 1\(^,\)Footnote 2