1 Introduction

Audiovisual multimodal generation is a very important research issue in computer vision, and it has a wide range of applications, such as solving the problem of missing modalities, generating zero-shot samples, and inspiring artists to create artistic creations. However, the generation of audiovisual modalities still faces huge challenges due to the large modal discrepancy between visual and audio modalities.

To address the problem of audio-visual multimodal generation, Chen et al. developed a conditional generative adversarial network (GAN) [1] which has several defects. First, the mutual generation of visual-to-audio and audio-to-visual is realized through two independent networks. Second, the generation of each path is realized through two stages. Specifically, the first stage extracts the discrimination information of the known modality, and the second stage uses the extracted discrimination information to generate the corresponding unknown modalities. In summary, these two pharses limit the efficiency of generation.

Traditional audio-visual multimodal generation has achieved promising results, and most of them are based on a single sample (that is image to sound spectrum or sound spectrum to image). However, as we all know, the visual and audio modalities in the video are both sequential. To further explore the essence of audiovisual multimodal generation and achieve sequence-level generation, we exploit the temporal information in the video for audiovisual multimodal sequence generation. The specific focus is based on the problem of generating dance pose sequence based on sound, referred to as dancing with the sound. Meanwhile, a novel dataset of visual-audio mutimodal sequence generation is constructed.

We formalized the task of dancing with the sound into a conditional generation problem (as shown in Fig. 1), and exploring it with various generative models. The key of our task (dancing with the sound) is to use the natural temporal coherence and rhythm of sound signals to generate diverse, reasonable, and smooth dance pose sequences.

Fig. 1
figure 1

Process of dance pose sequence generation

The main contributions of this paper can be summarized as follows:

  1. 1.

    We propose the problem of generating dance pose sequence through sound information.

  2. 2.

    We establish a novel visual-audio multimodal sequence generation dataset.

  3. 3.

    We verify the difference between noise and sound information in the process of dance pose sequence generation.

  4. 4.

    We verify the role of sound discrimination and rhythm in the generation of dance pose sequences

The remainder of this paper is organized as follows: Section 2 briefly reviews related works. In Sect. 3, the dance with the sound model is proposed. Experimental results are presented in Sect. 4 and discussions are elaborated in Sect. 5. Section 6 concludes our work.

2 Related works

In this section, we will discuss some research works closely related to our model, including sound to body dynamics, motion transformation and pose to image/video generation respectively.

2.1 Sound to body dynamics

The relationship between voice and facial motion has been extensively studied in some classic works [2, 3]. Recently, the generation of high-quality facial video only from voice information [4] and animation simulation [5, 6] have achieved remarkable results.

Although there is no mature work, exploring only from music to body posture generation. However, there is a number of work related to: multimodal research combining audio and video input, behavioral research on the body’s response to music and sound, and research on predicting changes in body posture (such as learning walking and dancing styles from videos) respectively. These methods are summarized below.

2.1.1 Multimodal research combining audio and video input

Multimodal research [7] indicates that combining audio and visual input will generate better results than a single modality only [8]. For example, facial expression recognition performance with additional voice input will perfrom better, because the additional information can help identify [9, 10]. Studies by Wang et al. [11] show that the body posture with voice input [12] is estimated with a high accuracy. Multi-speaker tracking [13] combines multimodal information to identify the intention of interaction [14].

2.1.2 Behavioral research on the body’s response to music and sound

In dynamic interaction, researchers have studied the relationship between speech and body rhythm. For example,  [15, 16] reports that specific audio characteristics are related to the frequency of body movement [17]. Different emotional situations have different types of movement and interaction. For example, regular interactions [17] and interviews [18] have been studied whether there is a correspondence between the pianists’ ways of playing specific music, and discovered the way we perceive music-motion correspondence is consistent. Meanwhile, it is shown that this correspondence is flexible, that is, the same music segment may correspond to different music variants.

2.1.3 Research on predicting changes in body posture

Many current researches use long short-term memory (LSTM) to predict and edit future poses from a short motion video or even a photo, such as [19,20,21,22,23,24,25]. Nowadays, CNN is usually employed for human pose estimation and recognition [26,27,28,29,30]. These recent works show that it is feasible to learn movement styles from videos [31] and further migrate to animated characters in real time [32]. However, the audio input is not available in their work.

Another complementary research area is predicting audio from video, which is the opposite of predicting video from audio. Some work includes estimating sound from face video for lip reading [33] or utilizing video [34, 35] or photos [36] as input to predict sound objects. Finally, there is work exploring video-driven audio editing [37].

The above mentioned related work does not directly study animations from audio or music to body posture, authors in [38] attempted to verify this possibility. In [38], some researchers have proposed a model that studies the dynamics of sound to body. The model utilizes violin or piano sound as input, outputs the corresponding video pose prediction, and further employs the video pose to make animation. The core idea is to create an animated character based only from audio. This animated character moves like a violinist or pianist when there is music input. They do not expect to reproduce precise body movements. In fact, it is assumed that the transition from speech/music to body movement is not unique, creating natural body postures are meaningful to music or speech.

Although Ref. [38] explored the possibility from audio to pose and verified their ideas based on experiments, their research focuses on music to violin and piano playing gesture sequences, with simple gestures and small moving space. To solve this problem, we further explored the generation of dance pose sequences from sound.

2.2 Motion transformation

Long-term pedestrian movement can be represented as a transition of a series of motion modes, and the motion mode is a short-term dynamic motion sequence. Yan et al. [39] proposed a motion transform variational autoencoder (VAE) for learning motion sequence generation. Their model jointly learns the feature encoding of one motion mode and the feature transformation that represents the transition from one motion mode to the next. Their model can generate diverse and credible motion sequences based on the same input. Inspired by this model, our work is to learn motion converters for pose sequences.

2.3 Pose to image/video generation

Chan et al. [40] proposed a “follow with me” movement transfer task, that is, transfer their movements to another person according to a given source dancing video. They treated this task as a frame-by-frame image-to-image translation with spatiotemporal smoothing constraints. Specifically, by using the pose representation as an intermediate representation of the source and target, they learned the mapping from the pose image to the appearance of the target subject.

Ma et al. [41] proposed a novel pose-oriented human image generation network (PG2), which can synthesize images of arbitrary pose based on an image of a person and a new pose. PG2 employs specific posture information and consists of two key stages, including posture integration and image refinement. In the first stage, the conditional image and target pose are fed into a UNet-like network to generate an initial but rough image with target pose. In the second stage, a UNet-like generator is trained through adversarial methods to refine the initial blurred image

Yang et al.  [42] extended the pose-oriented image generation network [41], and further proposed a pose-oriented video synthesis model in a decoupled manner. In the first stage, it employs the Pose Sequence Generative Adversarial Network (PSGAN) to generate a pose sequence based on class label through adversarial methods. In the second stage, the Semantic Consistent Generating Adversarial Network (SCGAN) is utilized to generate video frames from the pose while maintaining the coherent appearance of the input image.

3 Methods

In this section, we will introduce our dance with the sound method in detail.

Due to the appealing motion transition learning capability of motion transformation-VAE (MT-VAE) [39], our model seeks inspiration from it. However, there are differences between them. MT-VAE concerns on the pose prediction under the same modality at different moments. Specifically, the pose at time \(t+s\) is predicted from the posture at time t. However, our model focues on the pose prediction under different modalities at different moments. Concretely, the sound information at time t is utilized the pose at time \(t+s\).

Thus, we have made the following improvement to MT-VAE [39] for our work. The input of the model is changed from the pose samples of the same modality at different moments to the pose and sound samples of different modalities at different moments. Consequently, what our model learns is not the motion transition between pose sequences at different moments, but the transition information underlying the pose sequence at the present moment and the sound sample at the next moment.

During the model training pharse, our model can generate the next moment pose based on two inputs, the current attitude base and the next moment sound. During the test pharse, our model can generate different pose sequences under the guidance of sound at different times. As shown in Fig. 2, our model mainly contains four components, including LSTM encoder, latent encoder, latent decoder, and LSTM decoder respectively.

Fig. 2
figure 2

Examples of dataset

In Fig. 2, LSTM encoder encodes the pose base at time \((t:t+s)\) and music at \((t+s:t+2s)\) into corresponding features named Feature A and Feature B respectively. Among them, Feature A contains the horizontal and vertical coordinates of 18 joint points (the dimension is 36). Feature B is the mel-frequency cepstral coefficients (MFCC) feature of the sound. Here, s denotes is the stepsize of LSTM encoder. The latent encoder encodes the difference between feature A and feature B into latent space z. The latent decoder generates the pose feature Feature \(B\) \(^*\) in the future from latent space z and the current pose feature Feature A. The LSTM decoder generates the pose sequence from Feature \(B\) \(^*\).

The loss function of our model during training consists of two terms, including consistency loss function \({\mathcal {L}}_{Con}\), and motion smoothness function \({\mathcal {L}}_{Smooth}\) respectively.

$$\begin{aligned} {\mathcal {L}}_{Con}= {\mathcal {L}}_{l2}(P^*,P) \end{aligned}$$
(1)

where P is the original posture sequence coordinates, and \(P^*\) is the generated posture sequence coordinates. This loss function will force the generated coordinates to be as close as possible to the original pose sequence for generating a reasonable pose sequence.

To guarantee the generated pose sequence as smooth as possible, the \({\mathcal {L}}_{Smooth}\) function is utilized to constrain the second-order difference of the generated pose sequence coordinates and those of the original pose sequence as close as possible, expect to generate a smooth pose sequence.

$$\begin{aligned} {\mathcal {L}}_{Smooth}= \frac{1}{K-1}\sum ^{K-1}_{t=1}\Vert m^*_t-m_t\Vert \end{aligned}$$
(2)

where K is the total time step, and \(m_t\) is the second-order difference term, which can be represented as \(m_t=P_{t+1}+P_{t-1}-2*P_{t}\). in which \(m^*_t\) has the similar meaning, and it can be denoted as \(m^*_t=P^*_{t+1}+P^*_{t-1}-2*P^*_{t}\).

4 Experimental results and analysis

4.1 Setup

To perform our newly proposed task on dance with the sound, a novel dataset including 116 ballet dance recitals is established. For convenience and fair comparison, we synchronize all videos to 24 fps (frame per second). Specifically, 80% of the dataset is randomly chosen as training set and the rest is treated as test set, separately. The samples of the dataset are shown in Fig. 3.

Fig. 3
figure 3

Examples of dataset

Meanwhile, we employ the openpose [43,44,45,46] tool to extract the poses of all video images, as shown in Fig. 4. In Fig. 4, the first row indicates the original image, the second row denotes the original image annotated with 18 joint points, and the third row presents the 18 joint points extracted corresponding to each original image.

Fig. 4
figure 4

Examples of pose joint points extracted from the dataset

Adam is utilized to train the model. The number of LSTM hidden units is 512, the step size is 400 and the batchsize is 16. Furthermore, the learning rate is 0.0001, and the number of iterations is 200 epochs.

4.2 Evaluation of the discrimination of the audio information

Discrimination is essential in the prediction and generation computer vision tasks. If the input music/sound segment is not discriminative, which cannot generate a reasonable dance pose sequence. Consequently, to tackle the task on dancing with music, we verify the discriminative capability of the music from the three aspects, including difference between music-guided and noise-guided pose generations, same pose sequence with different audio sequences and same audio sequence with different pose sequences respectively.

4.2.1 Evaluation of the difference between music-guided and noise-guided pose generations

Most traditional pose generation tasks are based on noise to generate target samples (such as images, etc.). However, these tasks are different from ours. Specifically, most of them focus on the generation of a single sample, and our task concerns the sequence-level generation. Further, the condition of our task is sound rather than noise. To verify the effectiveness of our model, we first need to figure out whether the noise can generate a pose sequence. If possible, we further need to explore the difference between the dance pose sequence generated by noise and sound. The result of the comparison of the generated pose sequences from noise and sound are shown in Fig. 5. (a) of Fig. 5 shows the pose sequence generated by noise.(b) of Fig. 5 presents the pose sequence generated by sound. Figure 5 demonstrates that there is almost no diversity in the dance pose generated by noise, indicating that the model of sound conditions has collapsed. On the contrary, the dance pose sequence generated by sound has promising diversity, which shows that music samples have superior discriminativeness than noise. In order to make readers more clearly see the changes in the generated dance pose, we present them with a stepsize 10. All subsequent experimental results are illustrated in this way by default.

Fig. 5
figure 5

Generated dance pose sequence with noise or music

4.2.2 Evaluation of the same pose sequence with different audio sequences

To verify the discriminativeness of sound information, we compared the generated dance pose sequences based on the same pose base and sound sample at different moment. The comparison results are shown in Fig. 6. Specifically, the input pose base for all rows of Fig. 6 is sampled from time range t = 0:200, the input sound sample for (a), (b), (c) and (d) of Fig. 6 are sampled from time range 0:200, 200:400, 400:600 and 600:800 respectively. From Fig. 6, it is observed that the generated pose sequences based on the same pose base with sound at different moment are different, validating the discrimativeness of the sound information.

Fig. 6
figure 6

Generated dance pose sequence in different time based on the same pose base and different sound. a Generated dance pose sequence based on sound at 0:200 b Generated dance pose sequence based on sound at 200:400 c Generated dance pose sequence based on sound at 400:600 d Generated dance pose sequence based on sound at 600:800

4.2.3 Evaluation of the same audio sequence with different pose sequences

To validate the discriminativeness of the sound information from another perspective, we compared the generated dance pose sequences based on the same sound sample and pose base at different moments. The comparison results are shown in Fig. 7. Specifically, the input sound sample for all rows of Fig. 7 is sampled from time range t=0:200, the input pose base for (a), (b), (c) and (d) of Fig. 7 are sampled from time range 0:200, 200:400, 400:600 and 600:800 respectively. Figure 7 illustrates that the generated pose sequences based on the same sound sample with different pose base are almost the same, which verifies the discriminativeness of the sound signal from the opposite side.

Fig. 7
figure 7

Generated dance pose sequence in different time based on the different pose base and the same sound, a Generated dance pose sequence based on pose base at 0:200 b Generated dance pose sequence based on pose base at 200:400 c Generated dance pose sequence based on pose base at 400:600 d Generated dance pose sequence based on pose base at 600:800

4.3 Evaluation of the influence of the sound rhythm on the generated dance rhythm

In order to verify the influence of sound rhythm on the generated dance pose rhythm, we compared different models, inlcuding \(M_{slow}\), \(M_{normal}\) and \(M_{fast}\) respectively. Among them, \(M_{slow}\) indicates that the model utilizes the audio with a 2 times slower rhythm, which is implemented by performing 2 times upsampling on the original audio sample. \(M_{normal}\) denotes the model employs the audio with original rhythm. \(M_{fast}\) represents the model uses the audio with a 2 times faster rhythm, which is obtained by executing 2 times downsampling on the original audio sample.

Comparison results are reported in Fig. 8. (a), (b) and (c) of Fig. 8 show the pose sequences generated by \(M_{slow}\), \(M_{normal}\) and \(M_{fast}\) models respectively. Figure 8 demonstrates that different sound rhythm will generate pose sequences with different rhythms, which further verify the manipulability of the audio information. It should be noted here that the speed of the rhythm is reflected by the number of modes of pose motion displayed. The \(M_{slow}\) model generates fewer pose modes. On the contrary, there are more pose motion modes of \(M_{fast}\) model as expected.

Fig. 8
figure 8

Generated dance pose sequence based on music with different rhythms

4.4 Evaluation of the effective iterative updation

Based on the pose base, our model will generate a reasonable dance pose sequence under the guidance of the sound sample conditions. The step size of the input pose base and sound are both 400, and the generated pose sequence also with the step size of 400. In order to reduce the length of the pose base and without significant performance degradation, we utilize iterative strategy to generate pose sequence. Specifically, assuming that the total step size of the pose base is 400 and the iteration step size is t, the number of iteration generations is \(k=(400/t)\) times. That is to say, with the pose step size t, under the guidance of the sound sample with the step size t, a pose sequence of length t is generated. Then based on the generated pose sequence of length t, and the sound samples collected from time \(t+1\) to 2t, the model will generate the next pose sequence of length t. Iterate successively k times, and concatenating these generated pose sequences together, the model will obtain a dance pose sequence with a total length of 400.

To verify the useful guidance information contained in sound information and the effectiveness of iterative updation, models with different iterative step sizes, including 1, 4, 10, 50, 80, 200, 400 respectively, are employed respectively. Experimental results are shown in Fig. 9. Among them, (a)–(g) of Fig. 9 present the generated pose sequences with 1, 4, 10, 50, 80, 200,400 stepsize respectively. illustrates that when the iteration step size is 1, the space coordinates of the generated dance pose sequence are severely offset because of the excessive accumulation of errors. When the iteration step size is 4 and 10, the spatial coordinates of the generated dance pose sequence are more reasonable. When the iteration step is 50, 80, 200, 400 the generated dance pose sequences are much better. Therefore, under the guidance of sound conditions, the iterative update strategy is very effective. When the iterative step length is 1/8 of the total step length or less, a dance pose sequence almost equivalent to the original model can be obtained, which greatly decreases the training sample size.

Fig. 9
figure 9

Generated dance pose sequence based on different iteration stepsize

4.5 Evaluate the different audio feature representation

To validate the effectiveness of different audio feature representations, we design two groups of experiments. One group is to asses the smoothness and other is to verify the discriminative capability of different audio features.

Concerning the first group, we compare models with various audio features, including mfcc, logmfcc, mel, logmel, to evaluate the smoothness of different audio feature representations. Figure 10 exhibits the experimental results. We find that, on the one hand, features including mel are smoother compared to their corresponding mfcc variants (mel > mfcc; logmel > logmfcc). In addition, features without log transformation are smoother (mel > logmel, mfcc > logmfcc).

For the second group, we compare models with various audio features, including mfcc, logmfcc, mel, logmel, to assess the discriminative capability of different audio feature representations. Experimental results are displayed in Fig. 10. The mel feature has the worst discriminative, which is because it is infeasible to perform iterative pose prediction when the iterative step is smaller than 200.

Above all, we figure out that the smoothness and the discriminative of audio features are contrary and we chose the trade-off features, mfcc, as our default audio feature representation.

Fig. 10
figure 10

Generated dance pose sequence with different audio features, a Logmel b Logmfcc c Mel d MFCC

5 Discussions

Most traditional pose or motion prediction are formed as a conditional generation problem. However, the conditions utilized are usually noise, motion label and previous pose/motion sequence, which generated pose/motion sequence with short length. To genearte long term pose sequence, we leverage the sound as condition. We believe an assumption that the sound is sufficiently discriminative to facilitate the model to generate long-term pose sequences. Several experiments have verified our hypothesis, including the generated pose sequences based on the same sound sample with different pose base are almost the same, the generated pose sequences based on the same pose base and different sound sequences are different, and the dance pose sequence generated by sound has much more diversity than that of noise respectively.

Moreover, to furter mining the discriminativeness of sound information, we reduce the length of pose base and utilize iterative strategy to generate pose sequence. Experimental results demonstrate that when the length of pose base is 1/8 of the total step length or less, the generated dance pose sequence almost equivalent to that of the original model, which greatly decreases the training sample size. This result further illustrates the discriminativeness of sound information from another aspect.

Additionally, the sound signal has several promising properties, such as internal coherence, rhythm, etc., which endow the generated pose sequence with appealling coherence and rhythmic. The above experiment in Sect. 4.3 verifies the effectiveness of these sound properties in the pose sequence generation.

6 Conclusions

This paper proposes a novel dancing with the sound task, aims at exploring the feasibility of predicting long-term pose sequences with the aid of sound information and the possibility of cross-modal generation. Further, a model based on VAE framework has been proposed to tackle this task, which leverages the sound as input condition and output the corresponding dancing pose sequence. Additionally, a new dancing with the sound dataset has been established. Numerous ablation studies have verify the effectiveness of our model and the dance with the sound task. Specifically, the discriminativeness of sound information enables the model to generate a long-term pose sequence, and the natural temporal coherence and rhythm of sound information allow the genearted pose sequence with promising plausibility and the rhythmic.