MusicFace: Music-driven expressive singing face synthesis

It remains an interesting and challenging problem to synthesize a vivid and realistic singing face driven by music. In this paper, we present a method for this task with natural motions for the lips, facial expression, head pose, and eyes. Due to the coupling of mixed information for the human voice and backing music in common music audio signals, we design a decouple-and-fuse strategy to tackle the challenge. We first decompose the input music audio into a human voice stream and a backing music stream. Due to the implicit and complicated correlation between the two-stream input signals and the dynamics of the facial expressions, head motions, and eye states, we model their relationship with an attention scheme, where the effects of the two streams are fused seamlessly. Furthermore, to improve the expressivenes of the generated results, we decompose head movement generation in terms of speed and direction, and decompose eye state generation into short-term blinking and long-term eye closing, modeling them separately. We have also built a novel dataset, SingingFace, to support training and evaluation of models for this task, including future work on this topic. Extensive experiments and a user study show that our proposed method is capable of synthesizing vivid singing faces, qualitatively and quantitatively better than the prior state-of-the-art.

Existing work in the literature focuses on generating coherent dynamics of faces according to input speech audio [1][2][3][4][5][18][19][20][21][22].However, in many emotional scenarios, it is required that the head synthesis is driven by a composite audio which is coupled with not only speech but also other signals, e.g., the music audio contains both human voice and background music signals.Therefore, in this paper, we investigate the problem of synthesizing a vivid dynamic face which is not only in-sync but also delivers coherent facial dynamics with the input music audio, as is illustrated in Fig. 1.This is a non-trivial task, which can not be handled directly by existing methods.This is because common music audios are mixed by coupled human voice and background music signals, while most of the existing methods are designed for synthesizing face according to only the human speech signals, which will lead to undesired results due to the entanglement of different audio signals.
To tackle this challenge, we investigate the implicit correlation between the input signals and the facial dynamics.We treat the input music audio as a mixed signal which includes a human voice signal and a background music signal.According to previous work [2][3][4][5][20][21][22] and our observation, we argue that the lip movement is majorly related to the voice signal (also called the speech channel), while the head pose, facial expression, eye states relate to both the voice signal and background music signal.However, we would like to ask the questions: Are these subjective observations true? and How much do the human voice and background music signals affect the face dynamics?To answer these questions, we devise a decouple-and-fusion framework for this task.Firstly, we separate the input music audio into the human voice channel and the background music channel.Then we dynamically fuse these two separated signals in a feature selection fashion by introducing a Attention-based Modulator.The Attention-based Modulator modulates and balances the two signals for the downstream generators of facial expressions, head motions, and eye states.
In the singing scenarios, the motions of the head and eyes are usually emotional and dramatic, which raises challenges for generators to learn the more diverse and expressive motions as compared with the previous talking scenarios.We propose two ingredients to improve the expressiveness of the synthesis result.For the movement of the head, we propose to learn the rhythm of head motion that is decoupled from the absolute moving velocity, thus factoring off the ambiguity of the mapping between audio and head movement.For the eye states, we propose to synthesize both eye blinking and long-time eye closing states, which delivers much more expressiveness as compared with previous methods.
Besides, to learn the complex and implicit relationship between the music audio and face dynamics, we build a SingingFace Dataset from our recordings.The dataset contains over 600 singing videos with synchronous music audio.To our best knowledge, this is the first dataset regarding face dynamics and music audio.We believe it will promote future research on this topic.
In summary, this paper is featured as follows: • This is the first framework for synthesizing a singing face video driven by the input music audio mixed with human voice and background music signals.In the framework, we introduce the Attention-based Modulator to balance the effects of the two signals on the head movements, expressions, and eye states.• We propose to synthesize the speed and direction of head movements separately, instead of predicting head pose directly.The simple-yet-effective modification leads to more consistent head dynamics in line with music rhythm.Besides, we propose to decompose the eye states into eye blinking and long-time eye closing, which is much more realistic in singing scenarios.• We build the first dataset which contains expressive singing face videos with synchronous music audio, and make it public to facilitate future research on this topic.

Audio-driven Talking Face Synthesis
Audio-driven face synthesis has been widely explored.
Previous work [1,19,[23][24][25] focuses on establishing the mapping between facial motion factors and audio features.Brand [23] uses a Hidden Markov Model (HMM) to predict facial motions.Ezzat et al. [24] leverage an example-based method mapping phonemes to mouth shape and texture parameters in the Principle Component Analysis (PCA) space.Wang et al. [25] attempt to model a mapping between Mel-Frequency Cepstral Coefficients (MFCC) and PCA model parameters via an HMM approach.Benefiting from deep learning techniques, some works have been proposed to generate more diverse faces in sync with input audio.Shimba et al. [19] estimate active appearance model (AAM) parameters with the Long Short-Term Memory (LSTM) network.Cudeiro et al. [1] employ convolutions to encode speech and decode facial attributes to animate a 3D template.

Music-driven Animation
Music-driven human pose animation has been studied for decades.Early work [38][39][40] formulate the task as a template matching problem. Lee et al. [39] and Shiratori et al. [40] generate dance motion sequences with musical similarity based on manually defined audio features, while Cardle et al. [38] edit motions guided by musical features.Due to the limitations of capacity, these template matching approaches are not competent to generate diverse and natural dance motions.
It is interesting to note that music-driven singing face synthesis remains a rarely studied open problem.Song2Face [53] is the only one designed for singing scenarios up to now to the best of our knowledge.However, it operates on plain human singing voice, only working well without the disturbance of background music.
Synthesizing expressive singing faces from mixed music signals is more challenging and difficult in three aspects.Firstly, singing voice and background music are entangled together, making it difficult for models to extract phonemes related information, and further leading to inaccurate lip motions.In addition, the relative contributions of different driven sources change over time and are even interconnected with each other.Finally, the model should consider multiple downstream generation task at the same time to make the result look natural and realistic overall.To solve the above issues, this paper propose a decouple-and-fuse framework, that can generate realistic and rhythmical facial dynamics from mixed music wave.Therefore, the paper will open novel research directions in the domain of music-guided person synthesis.

Problem Definition
In previous researches [4,16,26,34,54,55], given a piece of speech audio A and a short reference video (or a single face image) V, the ultimate goal is to generate a realistic talking face video S synchronized with the input audio A, which can be represented as: where F exp , F pose , F eye denote the facial expression, head pose, and eye state parameters synthesized by a generator G, respectively.E refers to an audio feature extractor and R denotes a rendering network synthesizing photo-realistic images.However, directly predicting driving parameters from audio is not up to music scenarios due to the complicated mutual influences between human voice containing lyric information and background music containing melody information.We propose a decouple-and-fuse strategy to tackle the above problem, which firstly adopts an audio source separation model O to decompose music into human voice A v and background music A b , then gets encoded lyric feature L and melody feature M respectively using an attention-assisted two-stream encoder E. It encodes lyric and melody separately, and modifies the relative contribution of the two encoded features on the generation process through an attention mechanism.Finally a generator G is employed to generate the driving parameters of a singing face video S from the decoupled lyric feature and melody feature.The full pipeline can be formulated as follows: As illustrated in Fig. 2, our overall framework contains three components: 1) a driving parameter generator to translate music audio to facial expression, head pose, and eye states, 2) a reference module extracting fixed parameters such as face identity given a human face, 3) and a renderer to synthesize photo-realistic frames conditioned on above parameters.We employ a conditional-GAN-based method as our renderer, which is of the same architecture as [34].To enhance the expressiveness of singing faces, the generator G is designed as the following encoder-decoder architecture as is shown in Fig. 3.The Encoder (Sec.3.2) consists of a Two-stream Audio Encoder (TSAE) to encode lyric and melody separately and an Attention-based Modulator (ATM) to balance the contribution of different audio features.The Decoder (Sec.3.3) contains three downstream generators, including Expression Generation Network (EGN) for the generation of facial expression parameters, Pose Generation Network (PGN) for the generation of head pose dynamics, and Eye State Generation Network (ESGN) for the generation of eye state parameters.In the next subsections, we will introduce the five essential parts respectively and provide the corresponding learning objective and training strategy.

Encoder
As mentioned above, lyric and melody entangled in the original music wave show a complicated relationship in guiding the generation process of human face dynamics, making it difficult for the generation network to synthesize vivid face dynamics directly from plain music features.To tackle the problem, we employ a decouple-and-fuse strategy.Specifically, using a state-of-the-art audio source separation model Spleeter [56], we decompose the original music into human voice and background music.Then we encode lyric from human voice and melody from background music separately using a two-steam audio encoder.Finally, we adjust them with attention-based modulators to distribute the relative contribution of lyric and melody for each specific generation task.

Audio Feature Extraction
Taking the separated audio wave (human voice or background music) sampled at 16KHz of T seconds as input, we extract mel-frequency cepstral coefficients (MFCC) and their first derivatives with 25ms window size and 10ms window step, resulting in 26-D audio features of 100 frames per second.Furthermore, in order to incorporate temporal information and match the frequency of video frames (30 fps), the feature sequence are converted to overlapping windows of size 39 (corresponding to 390ms ) at 30 fps.Therefore, the output feature is a three-dimensional array with the size (30 × T, 39, 26).

Two-stream Audio Encoder (TSAE)
Given the separated human voice feature A v and background music feature A b , we adopt a Two-stream Audio Encoder (TSAE) that consists of two networks AE v and AE b to encode the MFCC features of human voice a v t and background music a b t , separately: where AE v and AE b are 1D temporal convolutional neural networks with residual blocks sharing the same network structure , and

Attention-based Modulator (ATM)
For a specific downstream generation task, the relative contributions of features representing different specific semantic information change over time and are even interconnected with each other.For example, image the head pose dynamics of a person singing a line of a song.He will prepare to vocalize, then sing, and shut his mouth finally.In the first and third stages, he rotates his head rhythmically dominated by melody.But when he vocalizes, melody in background music and lyric in human voice influence his head movements together.So the dominant source changes over time and even becomes ambiguous during vocalization, making the generation task difficult.
Therefore, in order to generate vivid human face movements, we introduce a channel attention mechanism similar to the attention mechanism proposed in [57] to determine the relative contribution between lyric and melody on the generation result.The only difference is that, to consider the long-time dependence between the audio features of different time steps, we select a temporal U-net to generate attention weights instead of using a simple multi layer perceptron (MLP) network.Specifically, given the separately encoded audio features, we employ an Attention-based Modulator(ATM) for each generation task to estimate an attention weight of each feature map in embedding feature f v and f b to adjust the relative importance between them: where l and m denote the final output embedding of lyric and melody features for the full audio sequence respectively, ⊕ represents the concatenate operation on the feature channel dimension, and indicates the element-wise product.ATM indicates the Attention-based Modulator implemented using an temporal u-net network U-net and σ represents the sigmoid activation function.
As shown in Fig. 3, we employ one ATM to learn the optimal attention weight for each downstream task.Specifically, we apply a total of three ATMs on f v and v b , to get l exp and m exp for expression generation task, l pose and m pose for head pose generation task, and l eye and m eye for eye state generation task, respectively.

Subject Style Embedding
Our TSAE, EGN, PGN, and ESGN are conditioned on the subject code to learn subject-specific styles, adopting a similar strategy in [1], which encodes each subject in the dataset using a one-hot subject encoding.At training stage, the subject encoding is concatenated to each input MFCC feature a v t and a b t , and also concatenated to the final output l t and m t of the ATM.

Expression Generation Network
We employ a simple MLP consisting of two fully connected layers and one ReLU activation layer to regress facial expression (including lip motion) parameters from the encoded lyric and melody features.The process can be formulated as: where ft denotes the predicted facial expression parameter at time step t and ϕ exp means the MLP for expression generation.

Pose Generation Network
Traditional audio-driven pose generation methods directly regress head pose parameter sequences from audio features [3,4,34], which does not agree with the fact that given a fixed audio sequence, different people even the same person singing the same song multiple times can produce mostly different head pose sequences as shown in Fig. 4. We find that although the dynamics of head pose vary when the same person sings the same song multiple times, as shown in Fig. 4, the speed of head pose keeps similar in line with the rhythm of the music.Motivated by this, we propose to generate the moving speed and moving direction of the head separately, and combine them to generate head pose p ∈ R 6 including Euler angles and a 3D translation vector at each time step.
Moving Speed Generation: At the first stage, we use an MLP network ϕ speed to predict the speed of head pose parameters according to the encoded audio features at current time t: where l pose and m pose are the lyric and melody embedding features for pose generation, ŝt ∈ R 6 is the output head speed at time step t.As ŝt can not be negative, we apply absolute function ABS to the output of ϕ speed .

Moving Direction Generation:
We use an LSTM network followed by a fully connected layer ϕ direc to generate the direction of head movements from the encoded audio features concatenated with the previous head pose and moving velocity at the last time step: where pt−1 , pt−2 ∈ R 6 indicate the generated head pose parameters represented by Euler angles and 3D translation parameters, vt−1 ∈ R 6 is the predicted head pose velocity, c t−1 and c t are the cell states, o t means the output of LSTM network, dt ∈ R 6 is the predicted moving direction, respectively at the corresponding time step.
Head Pose Generation: Finally, the pose p t at time step t can be directly calculated by: pt = pt−1 + ŝt × dt .

Eye State Generation Network
Traditional methods usually generate only random eye blinks from audio features [34] or noise inputs [54], ignoring some long-time eye closing phenomena in singing scenarios, e.g., people may close their eyes for a long time while singing the climax of the song.We decompose the generation process of eye states into random eye blinking generation and long-time eye closing state generation.Human blinks occur randomly and can be sampled from experimental predefined random distributions, but for long-time eye closing state generation, it should be learned from data.
Random Eye Blinking Generation: The normal blinks of human show regularity regarding the average human eye blinking rate and the average inter-blink duration [34].Accordingly, we uniformly sample the blink interval Then we generate the eye state of blink dynamics êblink ∈ {0, 1} according to B i and B d .

Long-time Eye Closing State Generation:
We employ an MLP network ϕ eye to generate the eye state êlong t at time step t: We combine the êblink t and êlong t to get the composite dynamics of eye states êt : Finally, we apply a temporal gaussian filter on êt to get more smooth eye state dynamics.

Learning Objective
We supervise our generator with the following loss functions: where L exp , L pose and L eye are the losses for facial expression, head pose, and eye states, respectively.L att is the loss term for pushing ATM to select useful feature channels.Each loss term is formulated as: where  2  2 is the velocity loss term, and L M M D [58] is the maximum mean discrepancy loss to match all orders of statistics between the prediction and ground-truth.Here we use x to represent the ground-truth, while using x for the predicted values.In our experiments, we empirically set w 1 = 5, w 2 = 50, w 4 = 10, w 5 = 5, and set other weights to 1.0.
Furthermore, in order to improve the diversity of generation results, we use an adversarial loss to fool the discriminator D, which is defined as : The total loss function in training phase is: 4 Experiments

Implementation Details
Our method is implemented with PyTorch, and all the experiments are conducted on two NVIDIA RTX 3090 GPUs.
For network training, we randomly sample the frame sequence with a sliding window of 128 frames.We adopt Adam optimizer during training, with a learning rate of 0.0001 for 50 epochs.Linear learning rate decay is adopted for the last 60% epochs.The hyperparameters in Eq. ( 13) are λ 1 = 1 and λ 2 = 0.1, respectively.To get vivid and photo-realistic visualization results, we train a rendering-to-video network by following FACIAL [34].

Dataset Organization
As mentioned above, popular conventional datasets only contain talking face videos that lack expressiveness.To overcome this, we build a new dataset called SingingFace.SingingFace includes more than 600 singing videos with 6 human subjects.Our supplementary video shows the learned style of different subjects when training across all the 6 human subjects.
Video Collection: We organize our dataset by recording singing videos ourselves.Specifically, we collect the singing audio set first, then the face region of the person singing the song with music played simultaneously is recorded.Finally, we automatically align each video to the corresponding music audio using SyncNet [59] to ensure audio-visual synchronization.
Audio Separation: We use a state-of-the-art audio source separation model Spleeter [56] to extract the human voice as lyric information and the background music as melody information, respectively.
3D Face Reconstruction: To automatically extract face expression parameters and head poses from a singing video, we adopt Deep3DFace [60] to extract face parameters [α, β, δ, γ, p], where α ∈ R 80 , β ∈ R 64 , δ ∈ R 80 are the corresponding coefficient vectors for geometry, expression and texture.γ ∈ R 27 is the spherical harmonics (SH) illumination coefficients.The 3D face pose p = [R; t] is represented by rotation R ∈ SO(3) and translation t ∈ R 3 .The PCA basis of geometry, texture, and expression are adopted from the Basel Face Model [61] and FaceWareHouse [62].
Eye State Extraction: We employ a state-of-the-art facial analysis system OpenFace [63] to extract action unit AU45r as the eye blink parameters.Note that we observed that the distribution of the extracted AU45r values for different people varies much, so we apply min-max normalization on AU45r for each video individually.Then we apply a time length threshold τ = 0.5s to detect the short-time blinking and long-time eye closing states separately.
Data Statistics: We collect over 600 Chinese and English singing videos totaling 40 hours with 30 FPS.Each video contains one person singing a whole song and the average length of videos is about 4 minutes.Each video has a stable camera location and appropriate lighting conditions.We

Ablation Study
To verify the effectiveness of the key ingredients in our proposed method, i.e., 1) the audio separation step and two-stream audio encoder (TSAE), 2) the attention modulator (ATM), and 3) the head pose generation network (PGN), we study the following variants of our method: • Single-Stream: no audio source separation; a single stream audio encoder is employed to encode the MFCC feature of the mixed audio wave; no ATM; and replace our PGN with an MLP network.• Two-Stream: equipped with audio source separation and TSAE; no ATM; and replace our PGN with an MLP network.
• With-ATM: equipped with audio source separation, TSAE and ATM, and replace our PGN with an MLP network.
• Ours: equipped with audio source separation, TSAE, ATM, and our proposed PGN.We compare the above variants using the splitted test set of SingingFace.We evaluate the Audio-Visual Confidence (AVC) scores proposed in [59], and Landmark Distance (LMD) introduced in [64] for lip synchronization comparison.However, to the best of our knowledge, there are no clear metrics for evaluating the realism of generated head pose and eye closing dynamics for now, which is a subjective task and is an open question to the public.Following Zhang et al. [65], we employ Canonical Correlation Analysis (CCA) on the generated head pose parameters sequences and eye state sequences with the ground truth and compute the Canonical Correlation as the evaluation metric for perceptual realism.Note that to emphasize the evaluation for the rhythm of the head pose dynamics that should be in line with music, we apply Canonical Correlation Analysis on the moving speed of generated head pose sequences instead of head pose parameters themselves.We also compute the second derivative based roughness (Rough) of the generated Euler angles defined in Eq. ( 14) for head motion smooth evaluation: where R (t) denotes the second derivatives of head rotation angles at time step t.The quantitative results of ablation study are summarized in Tab. 1.

Effectiveness of Two Stream Design:
As mentioned above, lyric and melody information are entangled together in plain music waves, making it difficult to learn facial dynamics in line with the music.It's verified that, by separating human voice and background music from plain music waves and encoding the features separately, our two-stream design greatly reduces the complexity of the lip synchronization task, thus leading to a better synchronization result.As shown in Fig. 5, if we just learn singing facial dynamics from plain music (Single-stream), the generated mouth movements are severely disrupted by the background music (e.g., the mouth still keeps open during silence).On the contrary, the other variants that apply our two-stream design perform correctly.On the other hand, after applying source separation and our TSAE(Two-stream), all of the evaluation metrics have been improved a lot compared with Single-stream shown in Tab.1.This improvement can be more clearly seen in our supplementary video.
Effectiveness of Attention-based Modulator: Our Attention-based Modulator automatically assigns optimal attention weights on different features at each time step for each downstream generation task.It allows our model to extract as much useful information as possible from the entangled audio features for each specific downstream task and eliminate the interference sources.This is verified from the experimental results that our ATM variant outperforms Two-stream on all of the evaluation metrics summarized in Tab. 1.
Effectiveness of Pose Generation Network: Compared with simple MLP networks, the head pose dynamics generated by our PGN show superior perceptual results.The improvement comes from that our PGN decompose head pose sequence generation task into moving speed generation and moving direction generation.Firstly, the network is able to concentrate on the generation of moving speed which is more related to the rhythm of music, resulting in more rhythmical head pose dynamics that are in line with the music.This is verified in Tab. 1, that our method outperforms others a lot on the CCA metric of head pose.Then, the LSTM module for moving direction generation is able to consider not only the current audio features but also the generated head moving history, resulting in the more smooth and spontaneous head moving curves.As shown in Fig. 5, the visualization of pose rotation curves generated by our method (Ours) are smooth and resemble closely the ground truth.Specifically, the turn of the dominant varying angle (shown as green curve) of our generated head occurs nearly at the same time with ground truth.It's recommended to check our supplementary video to compare the difference more clearly.

Analysis of Attention Weight:
To further investigate the role of the Attention Modulator (ATM), we visualize the predicted attention weights for synthesis tasks of facial expression, head movement, and eye state.As shown in the case illustrated in Fig. 6, it can be observed that: • When there is background music only and no human voice, the ATM pays more emphasis on the stream of background music (melody feature), as shown in the earlier part of the music (See Region I).• When there is both background music and human voice, in the tasks of face expression and head pose, the ATM modulates the weights between two streams to pursue more expressive results (See Region II).From the numerical viewpoint, the weights of the human voice are higher than that of background music.In this case, the human voice dominates the generation of face expression and head pose.• When there is both background music and human voice, both the human voice and background music affect the long-time eye closing state, simultaneously (See Region III) or separately (See Regions IV and V).

Compared State-of-the-art Methods
Most previous state-of-the-art methods are designed for talking scenarios and trained on talking datasets such as Voxceleb2 [67] and LRS2 [68].For a fair comparison, we select and retrain the methods whose training code are open to the public on our SingingFace dataset.The selected compared state-of-the-art methods are as follows: • ATVG [26] is one 2D-based cascade GAN approach to generate a talking face video that is robust to different facial characteristics, by taking an audio sequence and a target image as input.• Yi et al. [4] utilize 3D face model information to synthesize photo-realistic talking face videos with personalized pose dynamics.• LiveSpeechPortraits (LSP) [66] presents a live system utilizing 2D landmarks to generate personalized photorealistic talking-head animation in real time.
• FACIAL [34] integrates implicit attribute learning to synthesize 3D face animation with realistic motions of lips, head poses, and eye blinks.We also report the qualitative comparison results with Song2Face [53], which is the only one method designed for singing scenarios up to now to the best of our knowledge.ATVG [26] Yi et al. [4] LSP [66] FACIAL [34]  Song2Face maps each human voice segment to facial expression and head rotation parameters, and uses an adaptive filter network to incorporate information from neighboring frames for temporal stability.It should be note that Song2Face only models with single driving source (plain human singing voice) as input, while ours supports multiple driving sources, e.g.human voices and background music, and focuses on how to collaborate with different driving sources to generate more realistic head movements.In addition, eye states are dealt with as a part of facial expression for Song2Face, while ours decompose the generation of eye states as an individual generation task.Since the implementation of Song2Face is unavailable, quantitative evaluation with Song2Face is absence in this paper.It's recommended to see our supplementary video for better comparison.

Qualitative Comparison
Fig. 7 and Fig. 8 shows the visual comparison with other state-of-the-art methods.We show the summary of qualitative comparison results in this section.

Realism on Pose Dynamics:
As shown in supplementary material, ATVG [26] only generates talking face videos with a fully static head pose, which is against the human common sense.Yi et al. [4] generate photo-realistic videos but the talking faces usually show subtle movements due to the supervision pipeline.In addition, the generated head pose dynamics behave discontinuously because of the background matching trick used in [4], which matches short-term generated head poses to one same target frame Song2Face [53] Ours Fig. 8 Visual Comparison with Song2Face.Our method can generate photo-realistic frames, diverse head pose and natural eye closing dynamics, which is infeasible for Song2Face [53].
when the target frames are scarce in the target video.LiveSpeechPortraits [66] generates smooth but relatively small head movements.The generated head pose also shows a weak correlation with the rhythm of the music.FACIAL [34] and Song2Face [53] can generate more natural head pose dynamics than other compared state-of-art methods, but they still show only a few variations in head movement patterns over a long period of time.Our method can generate the most realistic head pose crediting to our pose generation method.To be specifically, for example, the head rotates quickly and dramatically during dense syllables, while slowly during pronouncing long syllables.Readers are recommended to watch the supplementary video to see the vivid visual results more clearly.
Realism on Eye States: The generation methods for eye states between the compared methods are various.ATVG and Yi et al. do not involve generating eye state parameters, therefore they do not produce any eye closing dynamics.For Song2Face and FACIAL, they learn random blink dynamics from data.However, Song2Face only performs well on plain human singing voice (no background music), and FACIAL only generates open eyes during inference, failing to generate spontaneous eye closing dynamics due to the complex entanglement between short random blinks and long-time eye closing states in SingingFace dataset.LiveSpeechPortraits directly sample random blink dynamics from target video and our method synthesizes random blinks from pre-defined random distributions, both of which show realistic random blink results.Moreover, as shown in Fig. 9, our method can also generate long-time eye closing dynamics (>0.5s) during voice based on the rhythm and emotion in the music, which further enhances the sense of realism.

Quantitative Evaluation
We use the same test set of music with the ablation study to compare our method with state-of-the-art counterparts.To clear out the effectiveness of the audio source separation model used in our method, we report the compared results on both mixed signals and separated signals (human voice and background music).Our method gets superior results on the most of metrics in the both cases.The results are summarized in Tab. 2.
Lip-sync metric: Similar to the ablation study, we evaluate the Landmark Distance Metric [64] and Audio-Visual Confidence score [59] to compare the lip synchronization of our method with the state-of-the-art methods.Tab. 2 shows that in both mixed and separated scenarios, our method beats all counterparts.It also shows that it is beneficial to separate the human voice from the plain music wave for mouth movement generation.Note that separated singing voice is given as input to the pre-trained lip-sync evaluation model during evaluation to ensure the pre-trained model performs correctly.
Pose Realism: In the evaluation of the realism of pose dynamics between different methods, we measure Canonical Correlation between predicted pose parameter sequences and ground-truth following with [65].Similarly, to emphasize the evaluation of the rhythm of the synthesized head pose sequences, we apply Canonical Correlation Analysis to the speed of the head movement.Tab. 2 shows that our method generates more realistic and rhythmic pose dynamics.
Eye Realism: We measure Canonical Correlation between predicted eye state parameter sequences and ground-truth following with [65] to evaluate the realism of long-time eye closing dynamics.For random blinking evaluation, we count the average blinking frequency (blinks/s) and intra-blink duration (s) of generated singing face videos, and compare them with ground truth.As shown in Tab. 2, these two statistics of our method are similar to the ground truth, falling within a reasonable range.
Sharpness metric: Cumulative probability blur detection (CPBD) is evaluated to measure the generated frame sharpness of different methods.Our implementation of renderer module generates the most sharpness facial texture according to Tab. 2. However, as shown in our supplementary video, the generated texture of mouth region when opening mouth wide and the generated texture of eyelid when closing eyes look a little blur.The blur texture should come from the data scarcity of open mouth and closed eyes.It should be easy to improve the texture, simply by training the renderer with more data, or replacing the renderer with a few-shot face generator.
Tab. 2 shows the effectiveness of the audio source separation step for the singing face generation task, that almost all the evaluation metrics improve after decoupling human voice and background music.It also shows the superiority of our method, which generates the most realistic singing face videos and behaves better on all the evaluation metrics.

User Study
We invite 15 volunteers to evaluate our method and previous works.The volunteers are a group of college students with gender balance, no previous face synthesis study experience, but are informed of the study's purpose, the standard for evaluation, and the number of compared video groups before making evaluations.The volunteers are asked to make evaluations of videos group by group.In each group, 5 synthesized videos of compared methods are shown in order.There are 5 video groups in total.During evaluating each group, the volunteers were asked to watch all the videos of the group, then to score the videos at once based on the following criteria: 1) audio-visual synchronization, 2) natural

Discussion
Limitation and Future Work: The proposed method achieves more expressive results against previous methods.However, as shown in the supplementary video, under chaotic environments, our method fails like previous methods, which is because the adopted audio separator can not distinguish different human voices.On the other hand, this paper focuses on the synthesis of the head region, leaving the dynamics of the upper torso unsolved.We should note that it is more challenging to generate a realistic and expressive virtual human with dynamics of the upper torso and even the full body.This will be the direction of our future efforts.Moreover, our method just learns the implicit context from input audio, and it's indeed a interesting improvement direction to incorporate semantics from lyrics of songs.
Ethics Statement: The work itself does not uniquely raise any new ethical challenges.However, we must acknowledge that the topic of image/video synthesis has been receiving many ethical concerns.These kinds of algorithms are vulnerable to malicious use, such as potentially misused to produce misleading information or violate the portrait right.Therefore, we appeal the research community and potential users to explore the techniques responsibly.

Appendix
Here we elaborate on more technical details of our proposed pipeline, including our Encoder, Decoder, and Discriminator.Note that we sample batches of data with frame window length of T = 128 frames and batch size of 64 during training.The Architecture of Encoder The Architecture of TSAE: Our Two-Stream Audio Encoder (TSAE) is composed of two Single-Stream Audio Encoder (AE) with the same structure but without sharing parameters.The Single-Stream Audio Encoder is a 1D convolutional neural network with residual blocks typically used for time series classification [69].The detailed architecture of our Single-Stream AE is shown in Tab. 4. The Architecture of ATM: Our Attention-based Modulator (ATM) is a simple unet-based 1D convolutional neural network (CNN), taking encoded audio features l⊕m ∈ R T ×256 as input, computing and applying attention values on the audio features, which is similar with the channel-attention mechanism proposed in [57] for CNN.The U-net structure of our ATM is summarized in Fig. 10, and the total architecture of our ATM is summarized in Tab. 5. Note that we adopt one ATM for each generation task with the same structure but without sharing parameters.

The Architecture of Decoder
All the MLP networks in our Decoder consist of two fully connected (FC) layers with ReLU as the activation function.The first FC layer in the MLP contains 128 nodes, while the node number of the second FC layer is determined by the task (64 for expression generation, 6 for head pose generation, and 1 for eye state generation).The input channel of the LSTM in our Pose Generation Network (PGN) is 268 (128

The Architecture of Discriminator
Our Discriminator is a simple CNN implemented with Conv1D, BatchNorm1D, and LeakyReLU layers.Taking concatenated audio MFCC features and generated parameters (59 channels in total, including 26 for the human voice, 26 for background music, 6 for head pose sequence, and 1 for eye state sequence) of a time window T = 128, the Discriminator predicts whether the input head pose and eye state sequence are real or generated.Note that we train our Discriminator using LSGAN [70] for training stability.The structure of our Discriminator is summarized in Tab. 6.

Fig. 2
Fig. 2 Framework overview.Taking human voice and background music separated from music audio as input, the Generator module generates facial driving parameters (expressions, head poses and eye states).Conditioned with fixed parameters (identity, texture, lighting) extracted from a reference face image and the driving parameters, the Renderer module aims to synthesize a photo-realistic video.Specifically, eye state parameters are encoded into eye attention maps, and other parameters provide a 3D model guidance to render faces.Finally, an expressive and rhythmic singing face video is rendered by combining rendered faces with eye attention maps.

Fig. 3
Fig. 3 The Architecture of Our Generator.Our generator contains an Encoder and a Decoder.The Encoder consists of a Two-stream Audio Encoder (TSAE) and an Attention-based Modulator (ATM).The Decoder contains three downstream generators, including Expression Generation Network (EGN), Pose Generation Network (PGN), and Eye State Generation Network (ESGN).
f b t indicate the encoded audio features.The subscript t indicates time step, and the superscripts v and b indicate human voice and background music, respectively.The encoded audio features of the full audio sequence f v and f b are obtained after stacking the audio features of each time step.

Fig. 4
Fig.4The Euler angle (Ry) dynamics of a person singing the same song twice.It shows although the head may rotate in opposite direction, the speed still keeps similar.This observation is also valid for other head pose parameters.
B i ∼ U(a i , b i ) and blink duration B d ∼ U(a d , b d ) with the empirical parameters a i = 1.2s, b i = 2.0s, a d = 0.10s, b d = 0.45s.

Fig. 5
Fig. 5 Qualitative Result of Ablation Study.It shows a) the generated frames and b) pitch (red), yaw (blue), and roll (green) of head pose dynamics.In a), the mouth generated by Single-stream still keeps open during silence (red box) while others keep closed (green box).In b), our generated head pose dynamics are smoother than others.And the turn of dominant varying angle (shown as green curve) occurs nearly at the same time with ground truth, meaning that our generated head dynamics have more similar rhythm to the ground truth recorded by a performer.

Fig. 6
Fig. 6 Attention Weight Visualization.The brighter white represents higher weights.The horizontal direction is along the time, and the vertical direction is along the feature dimension.

Fig. 7 Visual
Fig. 7 Visual Comparison with State-of-the-art Methods.a) and c) are the generated video frames.b) and d) are the corresponding tracemaps of facial landmarks across multiple frames.From the tracemaps, we can see our method generates the most diverse head pose dynamics.

Fig. 9
Fig. 9 Distribution of Eye Closing Duration.Our method is able to generate realistic closed eye duration of similar distribution with the real videos.

Fig. 10
Fig. 10 The detailed U-net structure used in our Attentionbased Modulator.for l pose t , 128 for m pose t , 6 for p t−1 and 6 for v t−1 ), and the output channel is 128.
w 1 , w 2 , w 3 , w 4 , w 5 , w 6 are balancing weights.f , p, v, e long are vectors containing the time serial ground truth parameters of facial expression, head pose, head moving velocity and long-time closing eye state parameters (note that we only learn long-time closing eye dynamics from data), ranging from t = 1, 2, ..., T .f , p, v, êlong are the corresponding predicted vectors.att exp , att pose , att eye are the predicted attention matrices for tasks of facial expression generation, head pose generation, and eye state generation, respectively.ABS(x) denotes taking absolute values for each elements.We only supervise the absolute speed of generated head pose dynamics here, guiding the network to generate more rhythmical head pose dynamics aligned with music.

Table 4
The Architecture of Audio Encoder.

Table 5 Detailed Structure of ATM.
Note that we treat the last channel of input as the feature channel, so the convolution and deconvolution operations are operated over the last dimension of input, and the Multiply in the table means att (f v ⊕ f b ).