Introduction

Facial animation generation is a highly challenging research problem with several applications such as facial animation for digital humans, computer games, movies, or immersive VR telepresence [32]. In these application, facial animation quality must exhibit a high level of naturalness and plausibility, ensuring intelligibility on par with actual human speakers. The human visual system is evolutionarily attuned to perceive nuanced facial movements and expressions; consequently, animations lacking natural expressions or synchronization with lip movements can be distressing for viewers.

In recent years, deep learning methods have made significant progress in various application areas [26]. Currently, speech-driven methods can generate realistic lip-synced 3D facial animations by training on 4D facial audiovisual datasets. However, speech-driven 3D facial animation ignores the generation of facial posture and expressions due to the weak correlation between speech and facial expressions, as well as head posture. This limitation is further exacerbated by the scarcity of 4D facial datasets, leading to static upper face animation [10, 12, 25]. Although some methods [32] can generate random eye-blinking or eyebrow motion when trained on high-precision datasets that are not publicly available, or transmit the three common emotions (happiness, anger, surprise) to 3D faces through an emotion transmission network [38], they lack deeper control over expression and posture. Existing studies have shown that it is difficult to obtain lip-sync, natural expression and controllable posture in 3D facial animation from a single speech modality alone.

The task of video-driven 3D facial animation is similar to that of single-image-based 3D face reconstruction, aiming to convert 2D images into 3D models. In recent years, significant progress has been made in using image data for single-image-based 3D face reconstruction. The most advanced single-image-based 3D face reconstruction technology, DECA [13], can predict camera, lighting, shape, texture, expression and posture parameters from a single photo fairly accurately. However, if the 3D face reconstruction technology is directly applied to video-driven tasks, the reconstructed mouth shape and movement often have serious artifacts in human perception, making it challenging to capture the lip-shape perception movement corresponding to speech. To improve the above mentioned limitations of 3D face reconstruction technology, EMOCA [11] added an expression network on top of DECA to regress expression parameters. Using a well-trained emotion recognition model, it calculated the emotion consistency loss (also known as emotion perceptual loss) based on the predicted results of the expression network, which could generate better expressions. EMOCA is a network model trained on single-image data. When it is applied to video data to generate 3D facial animations, there are still issues such as non-smooth animation effects and mismatch between mouth shape and speech. The state-of-the-art method, SPECTRE [14], improved DECA and EMOCA by modifying EMOCA’s expression network to a perceptual encoding network, which predicts both the facial expression and chin posture parameters. At the same time, they added a well-trained state-of-the-art lip-reading model to add lip-reading consistency loss (also known as speech perceptual loss) to the predicted results of the perceptual encoding network. With this approach, the reconstructed face has more accurate mouth movements, and when combined with the corresponding speech, it can produce more realistic effects. Video-based (performance-driven) 3D facial animation methods can conveniently and accurately generate expressions and postures from visual modality information, but compared to speech-driven 3D facial animation trained on 4D facial audiovisual datasets, it still has a natural disadvantage in lip shape perception.

Therefore, it is of great research significance to combine the speech-driven 3D facial lip animation trained on the 4D face audiovisual dataset and the video-driven 3D facial animation trained on 2D videos, while retaining the advantages of both methods, in order to obtain lip-synced, naturally expressive and pose-controllable 3D facial animation. This paper proposes a dual-modal generation method that uses speech and video information to generate more natural and vivid 3D facial animation, focusing on the mouth area and speech-uncorrelated expressions and postures. The main contributions of this paper are as follows:

  1. (a)

    Building an additional expression and pose network based on a speech-driven network trained on the 4D face dataset. The speech feature is extracted using the speech-driven network to generate basic lip animation, while the expression and pose network extracts temporal visual features to regress facial expression and head pose parameters. By integrating the speech and visual modal features, the related chin posture parameters associated with lip movements are obtained. These parameters are subsequently used to fine-tune the lip animation generated by the speech-driven network.

  2. (b)

    Designing a new video frame preprocessing algorithm that uniformly crops all frames in the video. This makes it easier for the expression and pose network to learn the temporal information between different frames and the transformation information of the face in the same background. The effectiveness of the preprocessing algorithm is verified through experiments, which improves the precision of the network model predictions.

  3. (c)

    Designing a "head pose consistency" loss to guide the network to reconstruct more accurate head poses and solve prediction errors in extreme head pose situations.

  4. (d)

    Conducting extensive objective and subjective (user research) evaluations to prove the superiority of our method. The effectiveness of each part of the method is verified through ablation experiments.

The rest of this paper proceeds as follows: in “Related work”, we provide a comprehensive review of previous studies related to speech-driven, video-driven and speech-video driven 3D facial animation. In “Design of algorithm”, we present the details of the proposed framework and provide a detailed illustration of its implementation. In “Experiments”, we demonstrate the performance of our new method, describe the experimental settings used, and present our experimental results. In “Ablation study”, we present the ablation experimental setting and our results. The conclusion and future work are given in “Conclusions”.

Related work

The generation of 3D facial animations has always been a challenging problem that has garnered significant attention in the fields of computer graphics and computer vision, leading to extensive research. Based on the different driving methods, facial animation can be categorized into text-driven [42], speech-driven [10, 12, 25, 32, 43], video-driven [11, 13, 14] and speech-video combined driven [8, 21] animations.

In the following, we will introduce the methods for generating 3D facial animations driven by speech, video and both speech and video that are most relevant to this paper.

Speech-driven 3D facial animation

In the fields of computer graphics and computer vision, speech-driven 3D facial animation aims to generate lip animation matching speech input by driving a 3D facial model. In recent years, speech-driven 3D facial animation based on deep learning has been extensively researched. For example, Richard et al. [31] use full speech-driven method to achieve real-time and realistic facial animation, but it is personalized and relies on hours of training data from a single subject. Taylor et al. [35] proposed a sliding window method to input phoneme sequences that need to be transcribed from audio, and used redirecting techniques to redirect outputs to other animation platforms. Karras et al. [22] design a end-to-end convolutional neural network to encode speech and use a latent code to eliminate ambiguity in facial expression changes. However, this model has low fidelity lip synchronization and facial expressions, and cannot be generalized to new character objects. Zhou et al. [43] adopted a three-stage network combining phoneme clusters, facial landmarks and speech features to predict visual phoneme animation curves.

Tian et al. [36] proposed a method based on deep bidirectional LSTM networks and attention mechanisms to map input speech features to cartoon facial animation parameters. However, their mapping from speech to the face may not preserve the identity and personality of the target speaker, especially when dealing with new speakers or sentences.

Cudeiro et al. [10] created the 4D face dataset VOCASET and proposed an independent 3D face animation method called VOCA. The model extracts speech character features using the DeepSpeech network [18], then feeds the extracted speech features into an encoder-decoder network that incorporates one-hot identity coding in the decoding stage to capture various speech styles. Finally, it outputs vertex offsets relative to the template 3D face to generate 3D facial animation. However, the generated facial animation mainly appears in the lower half of the face.

Liu et al. [25] considered the geometric representation of 3D models based on VOCA and used dense connections to combine features learned at different stages. They obtained larger implicitly embedded information from speech and selectively emphasized important features through added attention layers. This established a more effective neural network GDPnet for learning the complex nonlinear relationship between audiovisual signals and produced more realistic animation results. However, it still has the same problem as VOCA.

Richard et al. [32] proposed a MeshTalk model that uses a classification latent space to learn facial animation synthesis. They employed a cross-modal loss technique to decouple the upper and lower facial regions, resulting in the successful separation of audio-correlated and audio-uncorrelated facial movements. This achievement led to the generation of highly realistic facial animation for the entire face, including random eye blinking and eyebrow raising. However, it requires a large amount of high-fidelity 3D facial data to ensure animation quality and generalization to unseen identities.

Wang et al. [38] proposed a new deep neural network-based speech-driven facial animation model called 3D-TalkEmo, allowing customization of the speaker’s emotional state using an emotion transfer network. However, it was created from a monocular video-based 3D facial dataset, which may lead to unreliable results.

Yingruo et al. [12] proposed a Transformer-based autoregressive model called FaceFormer, which encodes long speech and predicts time-stable 3D facial animation using self-attention mechanisms. However, the existence of self-attention mechanisms brings quadratic space and time complexity, making it more complex and computationally expensive than other models, and there are also problems with expression and pose.

Video-driven 3D facial animation

Based on video-driven 3D facial animation aims to measure human facial expressions and poses using only monocular video, and apply them to target models. Researchers have been actively working in this field over the past few years to achieve accurate and expressive facial animation. For example, Cao et al. [7] proposed a regression method for real-time 3D facial capture using an ordinary camera, which can achieve the same robustness and accuracy levels as RGBD-based algorithms. However, this method requires training and calibration with specific user data, which greatly hinders its applicability. Later, Cao et al. [6] proposed a universal regression factor learned from public image datasets for reconstructing 3D facial shapes from video frames. Barros et al. [3] proposed a lightweight method for generating facial animation based on monocular video. This method estimates rigid head pose and non-rigid facial deformation by detecting and tracking 2D facial landmarks, and transfers facial expressions from 2D images to 3D virtual characters. Laine et al. [23] proposed a real-time deep learning framework for capturing facial performances based on video, i.e., given monocular video, dense 3D tracking of actor’s face is achieved. Dan et al. [11] added perception loss for facial expressions to the state-of-the-art single-image 3D face reconstruction network DECA [13], which captures finer and more extreme expressions by calculating the similarity between expression features obtained from both the original photo and the rendered image of a reconstructed 3D face model. However, it ignores temporal information between video frames and lacks lip shape perception. Furthermore, Filntisisdd et al. [14] added perception loss for lip shape and considered temporal information between video frames based on DECA and EMOCA. They trained the model on the publicly available multimodal speech-visual dataset LRS3 [2] for lip reading, and achieved better lip shape and more realistic 3D facial animation by calculating the similarity between lip reading features obtained from both the original photo and the rendered image of a reconstructed 3D face model as lip perception loss. However, they only used DECA’s head pose parameters for head control, while ignoring speech features. Overall, current methods for video-driven 3D facial animation have been widely explored and can achieve decent results, but due to the loss of 2D video information to 3D information, video-based facial animation remains a challenging problem.

Speech-video driven 3D facial animation

The aim of speech-video driven 3D facial animation is to achieve better results than single-modal information by fusing speech and visual information. In many tasks, multimodal approaches have received increasing attention. For example, Xu et al. [41] proposes a dual-modal emotion recognition framework consisting of a parallel convolution (Pconv) module and an attention-based bidirectional long short-term memory (BLSTM) module. In traditional methods, coupled hidden Markov models (CHMM) [4] can be used to obtain speech-video driven 3D facial animation. CHMM explicitly simulates the asynchrony between speech and lip shape using cross-time and cross-chain conditional probabilities [1, 39]. To decode the state sequence of visual parameters, literature [9, 15] uses Baum-Welch HMM inversion instead of the commonly used Viterbi decoding, resulting in more accurate animation control. However, the HMM only allows a single hidden state to occupy each time range, meaning that many states are needed to simulate multimodal signals rather than capturing the complexity of cross-modal dynamics. Xie et al. [40] overcame this problem by using dynamic Bayesian networks (DBN) and Baum-Welch DBN inversion to simulate cross-model dependencies and generate animation from speech. Liu et al. [27] introduced a unit selection-based system and applied dynamic programming to select candidate sequences for each input frame from a pre-collected audio-visual database. The selection of candidate frames is calculated based on the weighted sum of the distance between the inferred speech and visual frames and the candidate frames from the database. The weights are calculated based on manually crafted reliability measures of speech and video streams. In recent years, with the rapid development of deep learning and multimodal fusion, researchers have begun to focus on deep learning-based multimodal-driven methods. For example, Xin et al. [8] introduced an innovative audio-video joint-driven 3D facial animation system. This system incorporates a large vocabulary continuous speech recognition (LVCSR) system to align phonemes and employs a knowledge-guided 3D blendshapes mapping model for each phoneme. Furthermore, it enhances the quality of the generated facial animation by integrating both speech and video information. However, this method requires scanning the 3D face model to create a mapping from phonemes to blend shapes. Hussen et al. [21] proposed a neural network-based method to use audio-visual data to drive 3D facial animation. The neural network extracts audio embedding from speech spectrogram features and visual embedding from facial images. After fusing speech and visual embeddings, it regresses to speech-related facial controls using affine layers, while non-linguistic facial controls and head posture are inferred only from visual embeddings. The disadvantage of this method is that it considers the temporal information of speech but ignores the temporal information of video and overlooks the rich dynamic information of video.

Design of algorithm

Fig. 1
figure 1

Preprocessing workflow.The video preprocessing process includes face landmark detection, face cropping and normalization

Modeling

As discussed above, our goal is to generate lip-synced, naturally expressive and pose-controllable 3D facial animation. Let A be a raw audio data and \({\textbf{F}}_{1:T} = ({\textbf{f}}_1,\ldots , {\textbf{f}}_T)\) be a video data, where T is the number of video frames. A pre-trained speech-driven network takes raw audio data A as input and outputs predicted 3D facial lip offsets \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}=(\tilde{{\textbf{y}}}_1, \ldots , \tilde{{\textbf{y}}}_{T^{\prime }})\), and the speech encoder in the network outputs speech features \({\textbf{W}}_{1:T^{\prime }} =({\textbf{w}}_1,\ldots , {\textbf{w}}_{T^{\prime }})\), where \(T^{\prime }\) represents the number of frames output by the speech-driven network. In this paper, we construct an expression pose module, whose backbone network extracts visual features \(\mathbf {V^{\prime }}_{1:T} = (\mathbf {v^{\prime }}_1,\ldots , \mathbf {v^{\prime }}_T)\) from video data \({\textbf{F}}_{1:T}\), and a temporal convolutional network further extracts temporal visual features \({\textbf{V}}_{1:T} = ({\textbf{v}}_1,\ldots , {\textbf{v}}_T)\). The temporal visual features \({\textbf{V}}_{1:T}\) are used to regress only visually related expression parameters \(\psi \) and head pose parameters \(\theta _p\). By aligning and fusing the temporal visual features \({\textbf{V}}_{1:T}\) and speech features \({\textbf{W}}_{1:T^{\prime }}\), chin pose parameters \(\theta _j\) related to lip shape are regressed. Here, \(\theta _j\) is considered as a fine-tuning of the lip animation \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}\) generated by the speech-driven network, and is constrained by L2 regularization. The expression parameters \(\psi \), head pose parameters \(\theta _p\) and chin pose parameters \(\theta _j\) are decoded by FLAME [24] face to obtain 3D facial offset points \(\overline{{\textbf{Y}}}_{1:T}=(\overline{{\textbf{y}}}_1, \ldots , \overline{{\textbf{y}}}_{T})\). Resampled \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}\) and \(\overline{{\textbf{Y}}}_{1:T}\) are aligned and added to the 3D face template to obtain lip-synchronized, pose-controlled and naturally expressive 3D facial animation.

Preprocessing

To obtain inputs suitable for the expression pose network, it is necessary to preprocess the video frames. The preprocessing workflow, as shown in Fig. 1, includes 2D facial landmark detection, face cropping and normalization.

The preprocessing method for video frames is as shown in Algorithm 1. Firstly, 68 facial landmarks are detected for all video frames using the open-source FAN [5] face alignment network. Secondly, face cropping is performed on all video frames using an improved face bounding box cropping algorithm that considers the entire video sequence. By calculating the maximum cropping range based on the edge position of facial landmarks in all frames, the cropping box can cover all faces in the frame sequence, making it easier for the temporal convolutional network to learn the temporal information between different frames and the transformation information of faces under the same background. We validated the effectiveness of this pre-processing algorithm in the subsequent experiments, which showed improved accuracy of the network model prediction. To obtain a better coverage of the entire head region, random scaling is applied to the original face cropping box. The scaling factor is typically set between 1.4 to 1.6 times the original size, and a random scaling factor can improve the generalization ability of the network. Specifically, the center coordinate and original size of the face cropping box are calculated, and the new size is obtained by multiplying the size by the scaling factor. Then, a new face cropping box is calculated based on the center coordinate and new size, and the face is cropped into a fixed size of \(s * s\). Finally, the cropped face images are normalized by converting pixel values from 0–255 to 0–1 to accelerate the convergence of the neural network.

Algorithm 1
figure a

Video frame preprocessing algorithm.

Fig. 2
figure 2

System overview. The network model described consists of a fixed speech-driven network and a facial reconstruction network, with an additional expression pose network built to better predict expression parameters and pose parameters

Network architecture

The system overview of the speech-video joint-driven 3D face animation generation method proposed in this paper is shown in Fig. 2. The method utilized a pre-trained DECA [13] network and a speech-driven network as a foundation for training our own network. This approach enabled us to take advantage of the existing capabilities of DECA, while also integrating our own enhancements. With a fixed speech-driven network and face reconstruction network, an additional expression pose network is built to participate in the training and better predict the expression and pose parameters of the 3D face model. The speech-driven network takes audio as input and outputs the lip offsets of the 3D face, which serves as the basis for 3D face animation. The speech features generated by the encoder of this network are fed into the expression pose network for feature fusion. The expression pose network uses the fused dual-modal features to predict expression and pose parameters, adding expressions and poses to the basic animation generated by the speech-driven network and fine-tuning the lip animation. The shape parameters predicted by the 3D face reconstruction network will be combined with the expression and pose parameters predicted by the expression pose network during training, reconstructed into a 3D face using the FLAME face decoder, and geometrically constrained using facial landmarks. Finally, the predicted 3D face model sequence is rendered into a 2D video using differentiable rendering, combining the predicted texture parameters and camera transformation parameters predicted by the 3D face reconstruction network. Certainly, it should be pointed out that the overall structure depends on other pre-trained network models, and the accuracy of these other network models has an impact on the accuracy of the facial expression and posture network.

Inspired by EMOCA [11] and SPECTRE [14], we introduced pre-trained emotion recognition, pose estimation and lip reading networks to calculate the expression consistency loss, pose consistency loss and lip reading consistency loss (also known as perceptual losses) of the expression pose network. The emotion recognition network comes from a pre-trained model provided by EMOCA, the pose estimation network comes from a pre-trained model provided by Hempel et al. [20], and the lip reading network comes from a pre-trained model provided by Ma et al. [28]. By separately inputting the original video sequence and the rendered video of the predicted 3D face animation into the pre-trained perception models mentioned above, corresponding feature vectors can be obtained. In theory, the rendered video and the original videos should be consistent in expression, pose and lip shape. Therefore, by minimizing the distance between the feature vectors of the rendered videos and the original videos, we can optimize the output of the expression pose network.

Fig. 3
figure 3

The network architecture of our model. The input for the model includes both speech and video data. The fused speech and video features are used to regress the chin pose parameters related to lip shape, while the visual features alone are used to regress the expression parameters and head pose parameters

The detailed structure of the network model is shown in the following Fig. 3. The input part includes the speech input A and the video input \({\textbf{F}}_{1:T} = ({\textbf{f}}_1,\ldots , {\textbf{f}}_T)\), where T is the number of video frames. For the original speech data A, it is fed into the speech-driven network, and the encoder of the network outputs the speech feature \({\textbf{W}}_{1:T^{\prime }} =({\textbf{w}}_1,\ldots , {\textbf{w}}_T^{\prime })\). The predicted 3D face vertex offset points are represented by \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}=(\tilde{{\textbf{y}}}_1, \ldots , \tilde{{\textbf{y}}}_{T^{\prime }})\), where \(T^{\prime }\) represents the number of frames output by the speech-driven network. It should be noted that these 3D face vertex offset points are only related to lip movements, i.e., they are irrelevant to expression and head pose. In order to align with the video frames, we resample \({\textbf{W}}_{1:T^{\prime }}\) using linear interpolation to obtain \({\textbf{W}}_{1:T}=({\textbf{w}}_1,\ldots , {\textbf{w}}_T)\), and resample \(\tilde{{\textbf{Y}}}_{1:T^{\prime }}\) to obtain \(\tilde{{\textbf{Y}}}_{1:T}=(\tilde{{\textbf{y}}}_1, \ldots , \tilde{{\textbf{y}}}_{T})\).

The video data \({\textbf{F}}_{1:T}\) is fed into the expression and pose network, which uses the MobileNet v2 model as the backbone network to extract visual features \(\mathbf {V^{\prime }}_{1:T} = (\mathbf {v^{\prime }}_1,\ldots , \mathbf {v^{\prime }}_T)\) of the frames in the video. In order to use the temporal information between video frames and learn rich dynamic information from the video, a one-dimensional convolutional neural network with a kernel size of 5, stride of 1, and padding of 2 is used to build the temporal convolution layer of the expression and pose network, which can further extract temporal visual features \({\textbf{V}}_{1:T} = ({\textbf{v}}_1,\ldots , {\textbf{v}}_T)\) from the visual features \(\mathbf {V^{\prime }}_{1:T}\). Then we fuse the speech feature \({\textbf{W}}_{1:T}\) and the temporal visual feature \({\textbf{V}}_{1:T}\) together using a simple concatenation method. The fused features are then fed through two fully connected layers to regress the chin pose parameters \(\theta _j\) related to lip movements. The temporal visual feature \({\textbf{V}}_{1:T}\) is fed through a fully connected layer to regress the expression parameters \(\psi \) and head pose parameters \(\theta _p\) only related to the visual information. Here, \(\theta _j\) can be considered as fine-tuning for \(\tilde{{\textbf{Y}}}_{1:T}\), constrained by L2 regularization. Assuming that without considering expressions and head poses, the true 3D face lip vertex offset points corresponding to the video data are \({\textbf{Y}}_{1:T}=({\textbf{Y}}_1,\ldots , {\textbf{Y}}_T)\), then the output \(\tilde{{\textbf{Y}}}_{1:T}\) of the speech-driven network is close to \({\textbf{Y}}_{1:T}\), but with some error, namely \(\Delta {{\textbf{Y}}}_{1:T} = {\textbf{Y}}_{1:T} - \tilde{{\textbf{Y}}}_{1:T}\). This error is influenced by the speaking styles of different people (differences in lip opening) and the accuracy of the speech-driven network. We expect that the chin pose parameters \(\theta _j\) can be used to obtain the 3D face vertex offset points \(\Delta \tilde{{\textbf{Y}}}_{1:T}\) through FLAME face decoding, which is close to \(\Delta {{\textbf{Y}}}_{1:T}\), that is, we expect \({\textbf{Y}}_{1:T} = \tilde{{\textbf{Y}}}_{1:T} + \Delta \tilde{{\textbf{Y}}}_{1:T}\), indicating that the predicted chin pose parameters \(\theta _j\) learn the residual between the predicted results of the speech-driven network and the true results. At the same time, considering that this error is very small, L2 regularization is used to constrain the value of the chin pose parameters \(\theta _j\). Finally, the expression parameters \(\psi \), head pose parameters \(\theta _p\) and chin pose parameters \(\theta _j\) are used to decode a 3D face vertex offset point \(\overline{{\textbf{Y}}}_{1:T}=(\overline{{\textbf{y}}}_1, \ldots , \overline{{\textbf{y}}}_{T})\) containing expression and pose information using the FLAME face decoding method.

During training, this paper combines the predictions of various parameters from the 3D face reconstruction network known as DECA. These predictions are utilized to generate a 3D face model that aligns with the shape, posture and expression exhibited by the video face. This alignment is achieved through the FLAME face decoding method. The expression and posture network are optimized using geometric constraint loss and multiple consistency losses. However, due to the limitation of GPU memory during training, sending all frames of the video into the network at the same time would exceed the CUDA memory capacity. Therefore, only a continuous sampling of K frames from the video is used for each training iteration, and the starting position of the sample is random. During inference, adding \(\tilde{{\textbf{Y}}}_{1:T}\) and \(\overline{{\textbf{Y}}}_{1:T}\) to the 3D face template \(\overline{{\textbf{T}}}\) together can yield a lip-synced, pose-controllable and naturally expressive 3D face animation.

Loss function

To train the facial expression and pose network, we employ three consistency loss functions and a geometric constraint loss to guide the network to reconstruct 3D facial animations of expressions and poses.

Consistency losses

The expression and pose parameters, which are predicted by the expression-pose network, are combined with the shape, albedo, camera and light parameters generated by the 3D face reconstruction network. Additionally, the 3D face lip shape offsets, predicted by the speech-driven network, are utilized. These parameters collectively contribute to rendering 2D videos representing the anticipated 3D facial animation sequences that correspond to the original input video. As mentioned earlier, we introduce three pre-trained task-specific models that use the original video and rendered video to obtain their respective feature vectors. By minimizing the distance between the feature vectors of the rendered video and the original video, we can better guide the expression-pose network to generate 3D facial animation. The three consistency loss functions will be introduced below.

Emotion consistency loss The emotion recognition network is based on the pre-trained model provided by EMOCA [11], with ResNet50 as the backbone network. The original video and rendered video are fed into the pre-trained emotion recognition network, obtaining the emotion features \(\varvec{\epsilon }_V=E(V)\) of each frame of the original video and the emotion features \(\varvec{\epsilon }_{R}=E(V_{R})\) of each frame of the rendered video. The emotion consistency loss calculates the difference between the emotion features \(\varvec{\epsilon }_V\) and \(\varvec{\epsilon }_{R}\) using mean squared error (MSE), as shown in the following equation:

$$\begin{aligned} L_{e m o} = \left\| \varvec{\epsilon }_V-\varvec{\epsilon }_{R}\right\| ^2 \end{aligned}$$
(1)

\(L_{emo}\) measures the perceptual difference between each frame of the original video and the rendered video, rather than the geometric error. Optimizing this loss during training ensures that the reconstructed 3D face conveys the emotional content of the input video.

Lip-reading consistency loss The emotion consistency loss does not retain sufficient lip shape information for the mouth area, and geometric loss using 2D landmarks cannot guarantee accurate mouth lip shape motion. Therefore, an additional lip-reading consistency loss related to the mouth area is needed to guide the network output expression and jaw pose parameters to capture the complexity of the mouth lip shape motion. The lip-reading estimation network is a pre-trained model provided by Ma et al. [28], which takes cropped grayscale images around the mouth as input sequences and outputs predicted character sequences. In the lip-reading consistency loss, only the lip-reading features extracted during the intermediate process of the lip-reading estimation network are used, rather than the predicted character sequences. Thus, the cropped sequences around the mouth of the original video and the rendered video are fed into the pre-trained lip-reading estimation network, obtaining the lip-reading features \(\varvec{\xi }_V=L(V)\) of each frame of the original video and the lip-reading features \(\varvec{\xi }_{R}=L(V_{R})\) of each frame of the rendered video. The lip-reading consistency loss uses mean squared error (MSE) to minimize the difference between the lip-reading features \(\varvec{\xi }_V\) and \(\varvec{\xi }_{R}\), as shown in the following equation:

$$\begin{aligned} L_{lip} = \left\| \varvec{\xi }_V-\varvec{\xi }_{R}\right\| ^2 \end{aligned}$$
(2)

Pose consistency loss Emotion consistency loss and lip-reading consistency loss do not involve the optimization of head pose. Therefore, an additional pose consistency loss is introduced to ensure that the network can reconstruct the head pose of the original video, especially in extreme head pose situations. Currently, various methods for pose estimation have achieved good results. For example, a hand pose estimation model is proposed that consists of automatic labeling and classification based on a deep convolutional neural network (DCNN) structure by Qi et al. [30]. In our experiment, The estimation network is a pre-trained model provided by Hempel et al. [20], which takes the original video and rendered video as input and obtains the pose feature vectors \(\varvec{\eta }_V=P(V)\) of each frame of the original video and the pose feature vectors \(\varvec{\eta }_{R}=P(V_{R})\) of each frame of the rendered video. The pose consistency loss uses mean squared error (MSE) to calculate the difference between the pose features \(\varvec{\eta }_V\) and \(\varvec{\eta }_{R}\), as shown in the following equation:

$$\begin{aligned} L_{pose} = \left\| \varvec{\eta }_V-\varvec{\eta }_{R}\right\| ^2 \end{aligned}$$
(3)

Geometric loss

Although consistency loss helps to preserve high-level perceptual information, in some cases the model tends to create artifacts due to domain mismatch between rendered images and original images [14]. As the consistency loss relies on a pre-trained task-specific CNN network, which cannot guarantee that the input rendered images correspond to real images. Therefore, it is necessary to guide network training by using geometric constraints.

We use L2 norm penalty to punish the difference between the expression \(\varvec{\psi }\) and head pose \(\varvec{\theta }_p\) parameters predicted by our facial expression and pose network and those predicted by the DECA 3D face reconstruction network.

$$\begin{aligned} L_{\psi } = \left\| \varvec{\psi }-\varvec{\psi }^{D E C A}\right\| ^2 \end{aligned}$$
(4)
$$\begin{aligned} L_{\theta _p} = \left\| \varvec{\theta }_p-\varvec{\theta }_p^{D E C A}\right\| ^2 \end{aligned}$$
(5)

The above regularization term uses the estimation of DECA as a “good” starting point. In a sense, the prediction of the expression pose network should not deviate significantly from the DECA parameters, which have been proven to produce artifact-free results in practice. In other words, this regularization scheme indirectly imposes some constraints that are hard-coded by DECA and its training process. In addition, the chin pose parameter \(\varvec{\theta }_j\) is considered as a fine-tuning of lip animation results for the speech-driven network, and thus is also constrained using the L2 norm.

$$\begin{aligned} L_{\theta _j} = \left\| \varvec{\theta }_j\right\| ^2 \end{aligned}$$
(6)

In addition to the above-mentioned regularization loss, we also apply an L1 loss between 48 facial feature points of the nose, facial contour and eyes in the 3D face model and the 2D facial feature points in the video frame.

$$\begin{aligned} L_{l m k}=\sum _{i=1}^{48}\left\| (k_i-s\Pi (M_i) * w_i)\right\| \end{aligned}$$
(7)

In this equation, \(k_i\) represents the 2D facial feature points in the video frame, \(s\Pi (M_i)\) is the 2D facial feature points projected from 3D facial feature points, and \(w_i\) represents the weight corresponding to the feature point. For the 20 facial feature points in the mouth area, we apply a more relaxed L2 loss:

$$\begin{aligned} L_{lip\_l m k}=\sum _{i=49}^{68}\left\| k_i-s\Pi (M_i)\right\| ^2 \end{aligned}$$
(8)

To make the generated 3D facial animation frames smoother, a velocity loss is used to calculate the distance between the differences of consecutive frames 2D facial landmarks between predicted outputs and training videos. The calculation formula is as follows:

$$\begin{aligned} L_v=\sum _{j=2}^{T} \left\| (k_j-k_{j-1})-(s \Pi (M_j)-s \Pi (M_{j-1}))\right\| ^2 \end{aligned}$$
(9)

Experiments

Implementation details

Our network model was implemented using PyTorch and trained on a NVIDIA GeForce GTX 3080 Ti GPU. The Adam optimizer was used with an initial learning rate of 5e-5, which was reduced by a factor of 5 after 50,000 iterations. We used a video sampling sequence length of K=20 and a batchsize of 1.

Dataset

Our qualitative and perseptual evaluations followed a similar evaluation procedure as SPECTRE [14]. To conduct the evaluations, we used the following datasets:

LRS3 [2]: The LRS3 dataset is the largest publicly available, large-scale multimodal dataset for training and evaluating speech recognition, lip reading and speaker recognition models. Our approach was trained on its official training set (containing 31,982 speaking videos) and evaluated on its official test set (1,321 speaking videos).

MEAD [37]: This dataset consists of 48 actors (28 males, 20 females) from various races who speak TIMIT [16] sentences with seven basic emotions (happiness, anger, surprise, fear, sadness, disgust, contempt) plus neutral and three different intensity levels. The entire corpus contains 31,059 sentences. Similar to the evaluation in SPECTRE, we randomly sampled 2,000 sentences to create the test set and stratified them by emotion and intensity level.

TCD-TIMIT [19]: This corpus includes 62 English-speaking actors reading 6,913 sentences from the TIMIT corpus. We evaluated our approach using its official test set.

We compared our approach with the state-of-the-art video-driven methods, SPECTRE [14] and EMOCA [11], and generated 3D face animations using their provided models. Additionally, we introduced data from some 3D facial reconstruction methods from SPECTRE, including DECA [13], 3DDFAv2 [17] and DAD3DHeads [29], which obtained 3D face animations by reconstructing each frame of the video.

Qualitative evaluation

As the differences between predicted 3D facial expressions and lip movements and their corresponding ground truth may be affected by errors in the reconstructed 3D facial shape, evaluating them using geometric criteria may not necessarily correspond to human perception of facial expressions and mouth movements [11]. Therefore, we followed SPECTRE’s qualitative evaluation method [14], i.e., objectively assessing these methods based on lip-reading metrics by applying a pretrained lip-reading network on the output rendered images. To eliminate biases, we did not use the lip-reading model in training, but evaluated with a different architecture and pretrained lip-reading model based on the Hubert transformer architecture, called AV-HuBERT [33, 34]. We evaluated the following metrics: character error rate (CER) and word error rate (WER), as well as visual phoneme error rate (VER) and visual-word error rate (VWER), where visual phoneme-related error rate was calculated by transcribing the predicted and real data into visemes using Amazon Polly phoneme-to-viseme mapping.Footnote 1

As shown in Table 1, we report the qualitative evaluation results of different methods on the LRS3 test set, TCD-TIMIT test set and MEAD test set, i.e., four metrics: CER, WER, VER and VWER. RGB represents the lip-reading prediction results from the original videos in the dataset, while DECA [13], 3DDFAv2 [17] and DAD3DHeads [29] are experimental data provided by SPECTRE [14], and SPECTRE [14] and EMOCA [11] are the results of official pretrained models tested under the same environment as ours.

Compared with other methods, our method achieved lower CER, WER, VER and VWER error rates on the LRS3 test set as well as on TCD-TIMIT and MEAD. However, compared to the original video, the qualitative evaluation results for CER, WER, VER and VWER are still much higher. This is because the rendered images are from a different domain than the real images, and important features such as teeth and tongue are missing from the 3D facial model. Teeth and tongue play a significant role in detecting specific types of phonemes or visemes (such as dental and labial consonants). Overall, our proposed method achieved the best lip-reading performance among all methods.

Table 1 Comparison of qualitative evaluation results for different methods

Perseptual evaluation

Fig. 4
figure 4

Visualization comparison of different method. The videos are arranged from top to bottom in the order of a original video, b EMOCA method, c SPECTRE method, and d our method. The red box highlights the poor performance of lip shape, while the green box highlights the good performance of lip shape

Qualitative evaluation methods mainly assess the accuracy of lip shapes, while lacking evaluation on the realism and lip syncing effect of 3D facial animation. To evaluate the realism and lip syncing effect of the generated 3D facial animation, we conducted a user study following the perseptual evaluation in SPECTRE [14] and Speech-Driven Animation [12, 32]. To prevent any potential dataset bias, we exclusively evaluated our method using videos from the MEAD and TCD-TIMIT datasets. We randomly selected 30 videos (21 from 7 emotions and 3 intensities, and 9 neutral videos) from the MEAD dataset and 10 videos from the TCD-TIMIT dataset. For these 40 videos, we generated corresponding 3D facial animations using our proposed method, SPECTRE and EMOCA.

We designed two subtasks for this user study. The first subtask is a comparison of overall realism of the 3D facial animation. The original video, our generated 3D facial animation, and the comparison method’s 3D facial animation are played side by side, with the playback position of our method and the comparison method randomized. Participants need to choose the more realistic or closer-to-the-original-video 3D facial animation, or they can choose both if they cannot distinguish. The second subtask is a comparison of lip syncing, where only our generated 3D facial animation and the comparison method’s 3D facial animation are played side by side, with their left-right positions randomized. Participants need to focus on the lip shapes and choose the 3D facial animation with more lip syncing, or they can choose both if they cannot distinguish.

This user study invited 14 adult participants with normal cognitive abilities and good English proficiency, and the final results are reported in Table 2. Partial visual comparisons of different methods are also shown in Fig. 4.

Table 2 User study results on realism and lip sync

In comparison with EMOCA, our approach has been ranked better or equal in more than 77% of the cases for realism and lip syncing is ranked better or equal than EMOCA in 79% of the cases. When comparing our approach to SPECTRE, over 70% and 66% of the cases respectively rank our proposed method better or equal overall realism and lip syncing than SPECTRE. This user study indicates that our speech-video joint driven method outperformed previous video-based methods and achieved better results.

Ablation study

Ablation on preprocessing and audio feature

In order to explore the impact of video frame preprocessing algorithms, the speech feature and lip animation results of speech-driven networks on expression posture networks, we divided each part into six variants by considering one or more component parts:

  1. (a)

    The baseline method: the previous video frame preprocessing method is used, that is, each frame is cropped separately. There is no audio signal used as input.

  2. (b)

    Based on (a), we use our video frame preprocessing method, which involves uniformly cropping consecutive frames of the video.

  3. (c)

    Adding the speech feature of the speech network as input to the expression posture network based on (a) and performing feature fusion.

  4. (d)

    Using lip animation predicted by speech-driven network as the basis on (a).

  5. (e)

    Including (c) and (d) on the basis of (a), that is, adding the speech feature of the speech network as input to the expression posture network for feature fusion. Also using lip animation predicted by speech-driven network as the basis.

  6. (f)

    Our final method including (b), (c) and (d).

The qualitative evaluation results of various situations on the LRS3 dataset are reported in Table 3. It can be found that our video frame preprocessing and speech features, as well as lip animation from the speech network all contribute to certain improvement in expression posture. Among them, the video frame preprocessing has the best effect on VER reduction, which can surpass the combination of speech feature and lip animation. In addition, it also has good effects on WER and VWER, exceeding that of speech feature or lip animation alone.

Table 3 Ablation study on LRS3 test set
Fig. 5
figure 5

Comparison of lip-animation-based approaches with and without incorporating speech features: a Raw video, b Variant d, c Variant e

In practice, as shown in Fig. 5, we found that in variant (d), if only lip animation results driven by speech are added without fusing speech features, it will cause unnatural results on the MEAD dataset, although the corresponding four lip reading indicators have been improved. However, in variant (e), the problem is effectively solved by fusing speech features, while further improving the lip reading indicators. This may be because the expression posture network without speech features lacks sufficient information to accurately guess the lip animation predicted by the speech-driven network, resulting in unnatural results during fusion. This indicates that fusing speech features can improve the generalization ability of the expression posture network.

Ablation on pose consistency loss

To validate the effectiveness of the introduced pose consistency loss, this paper trained networks with and without pose consistency loss, respectively, and compared the results of EMOCA and SPECTRE networks, as shown in Fig. 6. By observing the third column, it can be found that the expression posture network with pose consistency loss can correctly predict the head posture parameters, while EMOCA, SPECTRE and expression posture networks without pose consistency loss all predicted incorrect head posture parameters. This is because these methods are all based on DECA to predict head posture parameters, but DECA cannot make correct predictions for extreme head postures. By observing the fourth column, all methods made incorrect predictions for more extreme head postures. The main reason for the expression posture network with pose consistency loss to make incorrect predictions is due to the incorrect predictions made by the head posture estimation network. This indicates the limitations of the head posture consistency loss, which is limited by the head posture estimation network, and the dataset may lack video frames with extreme head postures.

Fig. 6
figure 6

Comparison of head pose generated by different methods: a Raw video, b Head pose estimation network, c EMOCA, d SPECTRE, e Without pose consistency loss, f With pose consistency loss. The red boxes indicate frames where there are issues with head pose estimation, while the orange boxes indicate artifacts around the mouth. Our method uses Mean Squared Error (MSE) to calculate lip synchronization loss, which reduces the likelihood of mouth artifacts

In conclusion, compared with other methods, by introducing the head posture consistency loss, the prediction accuracy of expression posture network for some extreme head postures has been improved. However, due to the limitations of the head posture estimation network and the scarcity of extreme posture data in the dataset, accurately predicting extremely extreme head poses can still be difficult.

Conclusions

This paper proposes a dual-modal generation method that utilizes both speech and video information to generate more natural and vivid 3D facial animations. This method builds an additional expression-posture network that uses video input and speech feature input on the basis of the speech-driven method and conducts experiments on publicly available 2D face datasets. Additionally, we have designed a new video frame preprocessing algorithm that uniformly crops all frames within a video for consistency. To further enhance the network’s capability to generate accurate expressions and postures, we introduce three novel loss functions: expression consistency loss, lip-reading consistency loss and posture consistency loss. Qualitative and perceptual evaluation experiments demonstrate that our proposed method, which is driven by both speech and video, outperforms other existing models in generating 3D facial animations.

Currently, this paper utilizes the state-of-the-art 3D facial statistical model FLAME [24]. However, the facial model obtained through principal component analysis (PCA) fails to capture finer facial expressions. In the future, improvements can be made on the 3D facial model to enhance its nonlinearity and refine the expression manipulation capabilities. Additionally, the current 3D facial model lacks details such as teeth and hair. Exploring automated generation methods for these facial details is an area that can be further improved in the future. Integrating these improvements into the existing framework and achieving real-time performance presents a significant challenge.