A causal convolutional neural network for multi-subject motion modeling and generation

Inspired by the success of WaveNet in multi-subject speech synthesis, we propose a novel neural network based on causal convolutions for multi-subject motion modeling and generation. The network can capture the intrinsic characteristics of the motion of different subjects, such as the influence of skeleton scale variation on motion style. Moreover, after fine-tuning the network using a small motion dataset for a novel skeleton that is not included in the training dataset, it is able to synthesize high-quality motions with a personalized style for the novel skeleton. The experimental results demonstrate that our network can model the intrinsic characteristics of motions well and can be applied to various motion modeling and synthesis tasks.


Introduction
Human-motion generation is useful for many applications, such as human-action recognition [45,5], motion prediction [14], and video synthesis [35].Learning a powerful motion model from prerecorded human-motion data is challenging because highly nonlinear kinematic systems intrinsically govern human motion.As a solution, motion models should scale up effectively to multi-subject motion datasets, synthesize various motions, and be amenable to multiple tasks, such as motion denoising, motion completion, and controllable motion synthesis.
Recent deep-learning-based motion-synthesis algorithms have shown great potential for human-motion generation.Autoregressive models, such as restricted Boltzmann machines and recurrent neural networks (RNNs) [33,8,25], have been applied to motion synthesis by predicting the possibility of motion in the future.Variational autoencoders (VAEs) [43,22,10] and generative adversarial networks (GANs) [37,2,19,38,23] have also been applied to motion modeling and synthesis.However, such models must employ careful training strategies to avoid error accumulation and mode collapse.Phase-functioned neural network (PFNN) and its successors [16,31,32] introduced a phase and local phase to reduce the difficulty of motion modeling.
Inspired by the success of causal-convolution-based WaveNet [34] in multisubject speech synthesis, we propose a novel neural network (CCNet) based on causal convolutions to address the aforementioned issues in motion modeling and synthesis.We added one-dimensional (1D) convolution layers to enable CCNet to accept skeleton configurations as an input, which is necessary for the network to handle the scale and style variations among the skeletons of different subjects.The output of CCNet is the probabilistic density function (PDF) of the motion at the next time step, which is conditioned using the motions at previous time steps, control signals, and skeleton configurations.Using a meticulously designed training strategy, CCNet can effectively capture the intrinsic characteristics of motions of different subjects and generate more than 20,000 frames of motion.
The freezing issues frequently encountered in prior studies can be effectively mitigated with a Gaussian loss that simultaneously penalizes the deviation of joint angles, positions, and velocities in training.After being trained on motion-capture (mocap) data across multiple subjects, CC-Net can generate high-quality motions for different subjects.Furthermore, CCNet can synthesize motions for novel skeletons that are not in the training dataset.If the network is fine-tuned with a small motion dataset of the novel skeleton, it can generate high-quality motions that are similar to the skeleton's ground-truth mocap data.Although the topology of the novel skeletons is the same as that of the skeletons in the training dataset, their scale variations significantly influence the quality and style of the generated motions.CCNet can accommodate these unobserved variations, as shown in our experiments.
We built a new large-scale motion dataset based on 12 subjects with various motion types and transitional clips between different types of motions to model the intrinsic characteristics of multisubject motions.The dataset has 486,282 frames and will be made public together with our code.
To summarize, our main contributions are as follows.1) We propose a novel neural network based on causal convolutions to model the intrinsic characteristics embodied in the motions of multiple subjects, which has rarely been explored in deep-learning-based motion synthesis methods.
2) A new high-quality motion dataset is constructed across multiple subjects for motion synthesis.3) CCNet is trained on our new dataset and can efficiently generate high-quality motions (∼65 fps) and achieve state-of-the-art results for synthesizing characteristic motions for different subjects.

Related work
Data-driven motion-modeling methods have become mainstream in recent decades, owing to the development of motion capture techniques and increased computing power of GPUs.Please refer to [36,40] for a more comprehensive survey.In the following section, we review motion-synthesis methods that are related to our work.

Motion style
Our method exhibits superior performance in generating the motions of different intrinsic characteristics associated with different people, even when they possess the same type of human motions.For example, fat and thin men typically have different walking styles for the same motion type.A critical challenge is to model motion styles.[3,42] implicitly parameterize the motion styles to synthesize diverse motions.[41] proposes a motion-style transfer method for a single person to address the problem of unlabeled heterogeneous motions.Wen et al. [39] applied normalizing flows to the task of unsupervised motion-style transfer and achieved impressive results that outperformed state-of-the-art methods.However, none of these methods have been extended to model the variation in human motion styles across different subjects.Aberman et al. [1] proposed a motion-retargeting method to address skeleton variations.However, the motions generated for a new subject were strictly limited to the source motion in terms of types and trajectories and had nearly the same style as the source motion.[26] proposed a method to address skeleton variations; however, it only models personalized style variations for particular human motions, such as walking or running.In contrast, our method can scale up to generate characteristic motions for multiple subjects of different types, and source motions do not need to be provided as in motion retargeting.

Deep learning based motion synthesis
Deep learning is a remarkable tool for learning a compact, low-dimensional motion space from a dataset.Previous studies have explored many neural-network structures for motion modelling and have made significant progress in this area, such as mixture-of-experts [32,22], RNNs [27,6,9], fully connected networks [17,24], graph networks [21,7], VAEs [44,30], and GANs [38,23].These methods achieved good performance for short-term motion prediction or periodic locomotion-synthesis tasks, such as walking or running generation.Various types of long-term motion generation often exhibit over-smooth motions or freezing poses [32].Introducing a phase [16] or local phase [31] as an additional motion feature is helpful for avoiding such problems; however, obtaining these extra temporally related features requires effort.Exact probability models [15,29,39] based on normalizing flows can also solve this problem and add diversity to the synthesized motions.However, these normalizingflow-based motion-synthesis models are difficult to train and may cause noise and jerkiness.
Despite significant progress in deep-learning-based motion modeling and synthesis, constructing a model that is capable of accurately modeling the characteristics of motions across different subjects remains challenging.This is primarily owing to the lack of datasets containing features of multiple subjects.To address the challenges mentioned above, we propose CCNet, which models the intrinsic characteristics of the motions of different subjects by taking skeleton configurations as an input.
3 Our approach

Overview
The overall framework of the proposed system is illustrated in Fig. 1.The designed CCNet has three types of functional blocks: a motion-feature embedding encoder, a series of separate residual blocks (SRBs) (light-green blocks) used to capture the temporal correlations, and a decoder that maps the latent features to the probability distribution of the predicted motions.These blocks are discussed in section 3.2.
We represent the nth frame in our training data as , where x e n denotes the vector of the relative joint rotations, which is represented using exponential coordinates [12], x ω n is the vector of the relative angular velocities of the joints, x p n are the 3D joint positions relative to the previous frame, and x v n is the vector of the joint linear velocities.The foot-contact information in the nth frame is represented as a 2D binary vector x f n .The skeleton configuration combined with the direction, velocity, and motion type form our control signals c n = {c s n , c d n , c t n }, where c d n is a 12D vector formed by sparsely sampling the points on the motion trajectory, starting from the nth frame in a one-second motion clip, and c t n is a 10D vector using one-hot encoding, which represents all ten types of motions in our dataset.Specifically, the skeleton configuration c s n can be represented as c s n = {h r , t  where h r is the height of the root joint, and the 3D positions of non-root joints, that is, {t x 1 , t x 1 , t y 1 , t z 1 , ..., t x m , t y m , t z m }, are set to be relative to the root.Given the training data and corresponding control signals, our goal is to train CCNet F θ parameterized by θ to model the PDF of the predicted motion for the nth frame: where X = {x n−l−1 , ..., x n−1 } and c n are the motion data of the previous l frames and the control signals of the nth frame, respectively.We used Gaussian loss to encourage CCNet to output ground-truth motion with high probability, foot-contact loss to facilitate the removal of foot sliding in the generated motions, and smoothness loss to reduce jerkiness (see Section 3.3).Noise is added to the sampled training motion data; thus, the network is robust to the accumulated error in the motion synthesis and can produce high-quality, non-freezing motions.The slight foot sliding in the generated motions was removed using an inverse kinematic (IK) algorithm based on the predicted foot-contact labels.

Encoder
Encoder ψ E has a simple "Conv1D-ReLU" structure, where the kernel size of the 1D convolution is 1.Conv1D layers with a kernel size of 1 ensure that the motion feature at each input frame is independent.Formally, the encoder ψ E takes the motion representation X of the previous l frames as the input and maps the observed sequence to a latent vector z: The dropout layer before the encoder, denoted by D, is used to resolve the possible overfitting problem, and its drop probability is set to 0.5.

Separate residual blocks
The core component of CCNet is the set of SRBs ψ i R , which is illustrated in Fig. 2. The SRBs are similar to the residual blocks used in WaveNet [34], which uses dilated causal convolution to guarantee the temporal ordering of the input motion data.The difference is that we add the Conv1D layers to every SRB to extract the features of the control signals to enhance the model's ability to capture the motion characteristics of different subjects, and we fuse these features with the motions through summation.Our network includes 20 SRBs, which are executed recursively.Each i-th block takes the output of the (i − 1)-th block and the control signals as its input, whereas the inputs of the first block ψ 0 R are the outputs from the encoder and control signals.The kernel size of the Conv1D layers is 1.Zeros are padded before the feature of ψ E (x n−l−1 ); therefore, the output of the causal convolution of a frame n depends only on the frames before it.The padding size can be computed as (k − 1)d, where k is the kernel size, and d is the dilation size.The causal receptive field length CRL of CCNet can be computed using k and d as follows: We can set different dilation sizes d i for different SRBs to adjust CRL.The kernel and dilation sizes are set to 2 in all SRBs; accordingly, the CRL of CCNet is 41.We also experimented with other CRLs by setting different d i values; however, we observed that a CRL of 41 is optimal (see Section 4.8).All SRBs generate features z for the decoder as follows:

Decoder
Decoder ψ D is a simple "ReLU-Conv1D" structure, where the convolution kernel size is set to 1.It maps the summed features from the SRBs to the PDF of the predicted motion, as follows: where μn is a vector of channel-wise mean values.Vector σ n is used to compute the final standard deviation values σn = e −σn .This element-wise operation ensures that we always obtain positive standard deviation values for σn .Subsequently, the poses at frame n can be obtained by directly using the mean μn (default setting in our implementation), or they can be sampled from the predicted PDF.Note that the decoder can output n frames each time during the training iterations, owing to the fully convolutional operations.

Training loss
The training loss consists of four terms: Gaussian loss L G , motion smoothness loss L s , foot-contact label loss L f , and direction-control loss L d .The training loss can be formulated as follows: where the weights λ 1 , λ 2 , and λ 3 are empirically set to 10.0, 2.0, and 1.0 in all our experiments, respectively.

Gaussian loss
This term follows the Gaussian mixture loss in the work of Fragkiadaki et al., whereas we used only one mode and set the covariance matrix as a diagonal to reduce the number of parameters.It can be written as follows: where x n is the motion representation extracted from the nth frame, and the binary foot-contact label in x n is addressed in L f ; thus, it is not included in this term.The Gaussian loss learns to maximize the probability of the motion representation vector of the ground-truth mocap data during training; therefore, the captured motion data are of high probability.
We add a constraint to ensure that the standard deviation σn is greater than the threshold (1e-4) using a clipping operation.
After training, we observed that the standard deviations that were output by the trained CCNet were typically between 1e-4 and 1e-3, and their mean value was approximately 2.449 times the threshold of 1e-4.Consequently, we can sample a motion according to a Gaussian distribution to enrich the variations in the synthesized motion.The joint positions and linear velocities included in this term can help model the correlations between the rotational degrees of freedom of different joints because such quantities are affected by all the parent joints on the kinematic chain connected to the joints.

Smoothness loss
This term is a soft constraint that prevents a sudden change in velocities at joints and smoothens the synthesized motion, which can be formulated as The smoothness loss is only optimized for the mean of the predicted Gaussian distributions, because the motion generated by the network is typically close to the mean at each frame.

Foot-contacts loss
We adopt the binary cross-entropy (BCE) loss function to train the network to predict whether the foot is in contact with the supporting plane in the nth frame: where x f n is the ground-truth foot-contact label for the data, and xf n is the network prediction.Foot-contact labels can be used to trigger IK algorithms to remove foot sliding in the synthesized motions.

Direction-control loss
For simplicity, we represent this term as a Gaussian loss and integrate it into L G .It enables CCNet to predict the direction and velocity control signals, and we only use the mean of the predicted PDF, ĉd n and ĉv n , when generating motions.Thus, the final Gaussian loss becomes where σd n and σv n are the predicted standard deviations of the direction and velocity control signals, respectively, which are computed in the same manner as σn in Eq. 5 and are discarded after training.This term is helpful in interactive motion control when control signals are occasionally inputted by the user.In this case, the predicted control-signal values are fed into the network to continue motion synthesis.Using the mocap technique, we built a human-motion dataset for 12 different subjects.Three subjects were women, and the rest were men, as shown in Fig. 3.The dataset includes 10 types of motion: walking, running, jumping with the left foot, jumping with the right foot, jumping with both feet, walking backward, zombie walking, kicking, punching, and kicking while punching.All subjects were asked to perform the first seven types of motion, and five were asked to  perform the last three types of motion.We recorded approximately 20 min of motion for each subject and asked them to perform two types of motion in one sequence to facilitate the learning of transitions between different motion types.Finally, we obtained 486,282 frames of poses (comprising 27 distinct nonfinger joints) from our datasets.The test dataset was formed by all the motion sequences of subject 7, who was randomly selected from the subjects.Additionally, one motion sequence was randomly selected from the sequences of the remaining subjects.The test dataset was used to test how our network handles skeleton variations after being trained on multisubject motion data.It contained 41 motion sequences and 88,649 frames.

Baselines
We primarily focused on human-motion modeling that captures the intrinsic characteristics embodied in the motions of multiple subjects.This is a unique and rarely explored task compared to most existing methods.There are few methods that are similar to our method proposed in this study; hence, we compare our model to three classic models that are most similar: ERD in the work of Fragkiadaki et al. implemented using four LSTM layers as in [20], called ERD-4LR; DAE-LSTM [11]; and PFNN, which is an MLPbased network [16].To test the performances of these three network structures in modeling the motion characteristics of multiple subjects, we added parameters to their first layers to accept skeleton configurations as inputs.Please refer to the supplementary material for the detailed network parameters of the models.

Implementation details
We implemented our algorithm using PyTorch version 1.6.The RMSProp optimizer [13] was employed with an initial learning rate of 1e-4, which decayed to 1e-6 until 2000 epochs.The batch size was set to 256 with each sample containing a motion sequence of 240 consecutive frames.There are two steps to generating the training batches: 1) randomly selecting a motion clip from the dataset and then the starting frame index in the clip, 2) repeatedly using a one-frame interval for a sample of 240 frames in the clip, that is, the starting frame index, f s+1 , of the next 240 frame sequence is f s + 1.For an input sequence, X = {x 0 , x 1 , ..., x n−1 }, we add independent identically distributed Gaussian noise (with 0 mean and 0.03 standard deviation) to train the network to address accumulated errors in motion synthesis.During training, CCNet can produce the output Y = {y 1 , y 2 , ..., y n }, owing to the guaranteed ordering in all dilated causal convolutions, which is helpful for speeding up the training procedure.
Although the network can generate high-quality motion, slight foot sliding may still occur.If not mentioned, the IK algorithm is adopted to remove foot sliding in generated motions according to the predicted foot-contact labels.We refer to the initial frames that are input to CCNet to begin motion generation as seed frames hereafter.

Quantitative and qualitative evaluation on test dataset
We benchmark our CCNet with baseline models for the motion-denoising error (except for PFNN because it is primarily designed for controllable motion synthesis) and trajectory-following accuracy.We also present examples to demonstrate the quality of the generated motion.In motion denoising, we use the mean of the predicted PDF as the frame poses.In all other experiments, we sample poses from the predicted PDF.

Motion denoising and completion
The trained CCNet can be directly applied to motion denoising and motion completion.For motion denoising, we randomly select the motion sequence X of a subject and add independent identically distributed Gaussian noise (mean 0, standard deviation 0.01∼0.1) to obtain noisy motion data X.We use the mean of the predicted PDF as the denoised motion Y by feeding X to CCNet.Each frame of the denoised poses is not fed back into the network.Frames with indices less than CRL were denoised based on all the frames before them.Fig. 4(a) and 4(b) show the denoising results.The standard deviation of the noise in this experiment was set to 0.08.Before denoising, the trajectories of the right hand and right toes fluctuated, and the foot was underneath the ground in some frames.It can be observed that these artifacts are significantly reduced in the denoised motion.We compared CCNet to baseline networks in terms of the quality of the denoised motions.We added Gaussian noise to the test data with standard deviations of 0.03, 0.05, and 0.1 and then used CCNet, DAE-LSTM, and ERD-4LR, respectively, to denoise the noisy motion data.The error between the ground-truth motion and denoising result was computed as the Euclidean distance between their motion-representation vectors.We also trained these three models on a selected CMU mocap dataset to further compare their performance on motion denoising (please refer to the supplementary material for details on the selection of CMU mocap data).As shown in Tab. 1, the error of the denoised motion generated by CCNet was less than those of the motions denoised by DAE-LSTM and ERD-4LR.
The motion-completion procedure was similar to that of motion denoising.The experimental results are shown in Fig. 4(c) and 4(d).We first select a 700-frame motion sequence containing walking and jumping with both feet and then set the rotations of the joints of the right legs of 30% of the frames to 0. CCNet accepts incomplete motion as an input and outputs a complete, natural-looking motion.Moreover, it can be observed from Fig. 4(b) and Fig. 4(d) that the poses of jumping with both feet vary for different subjects, which means that CCNet can capture the intrinsic styles of different subjects.

Following user-specified trajectories
Synthesizing different types of motions along a specified trajectory is a desirable function in motion planning.We allow users to specify a motion trajectory J on the XOZ plane with additional velocity and motion-type information.We then map the trajectory into the control signals c d n and c t n that are supported in our system.Please refer to the supplementary material for further details.
As shown in Fig. 5, CCNet can synthesize motion using two user-specified trajectories.Fig. 7(a) shows that the synthesized motions can follow a trajectory with large curvatures and frequently changing motion types.In Fig. 7(b), we show that CCNet can generate various motions, such as the kicking and punching present in our training dataset, when the user specifies these two types along a trajectory.We leveraged the average distance between the userspecified and root trajectories on the XOZ plane as the criterion for comparison for the trajectory-following accuracy.In this experiment, we used six different trajectories that were manually specified by users, extracted the direction control signals, and randomly assigned motion types to the trajectory segments.Subsequently, we synthesized motions using the first 120 frames of the 33 locomotion sequences in the test dataset as the seed frames for each specified trajectory and obtained 198 motion-synthesis results.The trajectory distance is computed by summing the closest distance between the projected root position and target trajectory in each frame.The means and standard deviations of the averaged trajectory distances are as follows:27.878cm ± 8.516 cm for CCNet, 158.67 cm ± 30.94 cm for PFNN, and 171.973 cm ± 31.862 cm for ERD-4LR; an example is shown in Fig. 6.The results of the CCNet model are more accurate than those of the baseline models.We present the six trajectories in the supplementary material.

Interactive control
CCNet can easily be integrated into interactive applications.We demonstrate this capability by developing a demo that allows the user to control direction, velocity, and motion type through a keyboard.Direction and velocity signals are used to generate future motion trajectories c d n online, similar to PFNN.We used the LibTorch API to ease the implementation of CCNet in C++.
Specifically, the user can control the motion type with the number keys, from 1 to 5, to select from five motion types: walking, running, jumping with the left foot, jumping with the right foot, and jumping with both feet.Once a key (for example, 2) is pressed, we update the motion-type label by interpolating the new type label with the previous one in 20 frames, which means that the character can smoothly transition from the previous motion type to the new one.The user can also control the velocity by pressing the up and down keys and the heading direction of the character by pressing the left and right keys.Once the left key is pressed, the trajectory turns to the left.This is achieved by first computing a small offset vector o n = [1, 0] * h * 0.015, where h is the height of the root.This offset is added to c d n by o n * w i , where w i = i/5, i = 0.5.Thus, the offset is added to the six points in the predicted control signal ĉd n through the corresponding w i .The distance between the 2D points in the updated ĉd n is then adjusted according to the user-specified velocity v u .Because ĉd n represents the future motion trajectory within one second, we can adjust the velocity by multiplying the distance between the 2D points by the ratio v u /v cur .The current scalar velocity of the character, v cur , is computed using the length of the 2D points in c d n .The velocity is changed from the current velocity to the user-specified velocity within a 20-frame interval.Fig. 8 illustrates the user interface used in interactive control, and further results can be observed between 1 m 32 s and 1 m 51 s in the accompanying video.

User Study
To measure the visual quality of the generated motions, we followed the advice of a researcher in the field of human interaction to conduct a two-alternative forced-choice user study.We selected 16 participants (six women and ten men) with experience in 3D animation or games because they are able to judge motion quality.Then, we gave the participants five clear and detailed criteria, which are described in the supplementary material, and showed them some examples for each criterion before the user study.The procedure of the user study is as follows.First, we presented the 16 participants with all the groups of motion sequences: four groups for CCNet and the baseline models.Each group contained 16 pairs of motion sequences.In each pair, one is the mocap sequence, and the other is generated by CCNet or one of the baseline models.Second, we asked the participants to answer the question, "Which motion sequence in the pair is of better motion quality?" according to the five criteria.
After obtaining the user study results, we checked them.First, we checked the time that each participant spent on completing the questionnaire.If the time was less than 10 min (the shortest time needed to judge all motion sequences), we discarded the questionnaire.Second, if a questionnaire had blank responses, it was discarded.Third, if a questionnaire had conflicting choices, it was also discarded.For example, if a participant chose A, B, and C as the better sequence from three pairs, (A, B), (B, C), and (A, C), we treated these as conflicting choices.From the first two choices, we can infer that A is better than C; however, the participant chose C as the better sequence from the third pair.Finally, we obtained 15 valid questionnaires for DAE-LSTM and 16 for the other models.We performed a t-test on the user study results to verify the hypothesis that CCNet can generate motions of better quality than the baseline models, and the results are shown in Tab. 2. The P values of CCNet versus other baseline models were all less than the selected threshold (0.05).Therefore, the motions generated by CCNet were significantly different from those generated by the baselines.Based on the average number of motion sequences selected by the participants (mean in Tab. 2), the number of choices for CCNet is larger than those of the other baselines, which verifies that CCNet can better capture the intrinsic characteristics of the motions of different subjects.Furthermore, we prepared another three-group dataset that contained pairs of motion sequences generated by CCNet and each baseline model.As listed in Tab. 3, the number of CCNet-generated motion sequences selected by the participants was still higher than that of the sequences generated by the baseline models.

Table 2. T-test of user-study results (confidence interval=0.95). VS: performing t-test between the results of CCNet and all
We also performed an ANOVA test on the user study results, as illustrated in Tab.4, which verifies the user study's statistical significance.

Generalization to unseen skeletons
After training CCNet with multisubject motion data, the model can generate motions for skeletons that are not in the training dataset.As illustrated in Fig. 3, 4(a), 4(b), and 7(a), we applied the trained CCNet to automatically generate motions for the skeleton of subject 7, which was unseen during training.It can be observed in Fig. 9 that the baseline models cannot differentiate the variations exhibited by different skeletons as effectively as CCNet can.We further tested the generalization ability of CCNet by applying it to a specially designed skeleton that was generated by scaling the skeleton of subject 7. The topology of the skeleton remained the same as that of other skeletons but varied substantially in the lower-body scale.Because there is no mocap data for the skeleton, we utilized the motion-retargeting [4] algorithm to generate 120 seed frames for it.Fig. 10 illustrates that CCNet can effectively generalize the new skeleton.In addition, we used ERD-4LR and PFNN to generate motions for the skeleton.The results show that both motions contain large, sharp changes between the seed frames and generated frames, which is inferior to the motions generated by CCNet.Please refer to the accompanying video from 4 m 16 s to 4 m 21 s for the relevant results.
Given a part of the motion data of a novel skeleton, CC-Net can learn to generate motions for the skeleton that are similar to its ground-truth mocap data.Tab.5 shows that, after fine-tuning the network using the walking and running motion of subject 7, the relative pose difference rel p for all mocap data of this subject in the test dataset can be significantly reduced (refer to the accompanying video from 2 m 31 s to 2 m 47 s for the comparison of generated jumping motions of subject 7 before and after fine-tuning).This implies that CCNet can capture the intrinsic characteristics embodied in the motions of the new subject better than other models.We compute the relative pose difference as , where N is the number of frames, and xn and x n are the motionrepresentation vectors of motions generated by CCNet and the corresponding mocap data.The ability to generalize to new skeletons is crucial because it can reduce efforts to capture a large amount of mocap data for a new skeleton in motion-synthesis applications.
To evaluate how the number of subjects in the dataset influences the generalization ability of CCNet, we intentionally put the motions of subjects 1, 3, 4, and 8 into dataset 1 and the motions of subjects 0, 5, 6, and 11 into dataset 2. Tab. 5 lists the rel p values of the motion generated by CCNet trained on dataset 1 (CCNet-D1) and dataset 2 (CCNet-D2).Because the heights of subjects 1, 3, 4, and 8 are closer to subject 7's height, the rel p value of CCNet-D1 is less than that of CCNet-D2; however, this value is still larger than that of CCNet trained on the entire training dataset.Thus, to improve the generalization ability of CCNet for new skeletons, it is better to construct a dataset with more subjects so that the network learns how to process their skeleton and style variations.

Motion prediction on H3.6M dataset
We directly trained CCNet on the H3.6M [18] dataset without any additional modifications to test its motionprediction ability.We followed the same data representation and reported the mean angle error (MAE) in the same test dataset, as Fragkiadaki et al. and Liu et al. do.We compared CCNet with ERD-4LR, HP-GAN [2], QuaterNet [28], and AM-GAN [23], and the results are reported in Tab. 6.Because Liu et al. have not yet released their code, the MAEs of AM-GAN in Tab.6 are sourced from their paper.CCNet is primarily designed for long-term multi-subject motion generation (typically at least 10 s) instead of motion prediction; however, it also achieves average performance among these methods.We believe that with meticulous adjustments, the performance of CCNet can be improved for motion prediction.

Evaluation of network hyper-parameters and training settings
In this section, we present ablation-study experiments to determine the following hyper-parameters and training settings selected for CCNet: 1) the causal receptive field length (CRL), 2) number of consecutive frames of each sample in a batch (N CF ), 3) joint rotations and angular velocities in the data representation (ROT), 4) smoothness loss term (Smooth), 5) skeleton configurations (SK), 6) advantages of SRBs over LSTM (SRB2LSTM), and 7) seed frame length when synthesizing.We chose these values or settings to minimize loss L in Eq. 6 for both the training and test datasets.For a better visualization, we plotted the logarithmic loss curves using the formula log 10 (L + 320) in Fig. 11 and 12.
Because L is typically approximately −300, a bias of 320 is necessary to obtain a positive result.

CRL and NCF
We conducted experiments to choose the CRL value among three settings: 31 (dilation sizes of SRBs are repeatedly 1 and 2), 41 (dilation sizes of SRBs are 2), and 46 (dilation sizes of SRBs are repeatedly 1, 2, and 4).The number of SRBs is fixed at 20.We also tested three different N CF s, 60, 120, and 240, which correspond to 1 s, 2 s, and 4 s of motion for our 60-fps motion dataset, respectively.Note that we keep the N CF fixed at 240 when computing the loss on the test dataset for fair comparisons.Based on Fig. 11, we chose CRL = 41 and N CF = 240 for CCNet because these settings led to the lowest loss.

ROT, Smooth and SK
To verify their importance, we removed the joint rotations and angular velocities from the input and removed the smoothness loss term from Eq. 6.Additionally, the skeleton configurations were removed by disconnecting their corresponding 1D convolution modules from CCNet.We can observe from Fig. 11 and 12 that without the joint rotations and angular velocities, smoothness loss term, or skeleton configurations, the corresponding networks overfit the training set.Therefore, we can conclude that encoding joint rotations and angular velocities in the data representation and the smoothness loss term are essential for CCNet to converge to a better result.Additionally, skeleton configuration is important in the network to disambiguate the motions of different subjects.

LSTM vs. SRB
We determined the advantages of the SRBs over LSTM by replacing SRBs with LSTM layers with different numbers of layers and hidden state channels.The results show that only the 1-layer and 2-layer LSTM settings with 512 hidden channels using a smaller initial learning rate, 1e-5, can converge to but overfit the training set, as depicted in Fig. 12.

Seed frame length
The influence of seed-frame length on the quality of the generated motions was measured using rel p .Specifically,  we extracted seed frames from the mocap data in the test dataset and then used the networks to predict a frame for comparison with the corresponding mocap frame.Tab.7 shows the rel p values for different seed-frame lengths.The lower value indicates that the generated motion has more similarity with the mocap data; thus, it is of better quality.It can be observed that CCNet is robust to variations in the length of the seed frames compared to ERD-4LR and DAE-LSTM, and it does not require excessively long seed frames to synthesize high-quality motions.However, we observe that jitters between the seed frames and generated frames are slightly more obvious when the seed-frame numbers are one and five (please refer to the accompanying video from 3 m 23 s to 3 m 46 s for details).We hypothesize that CCNet cannot obtain sufficient information to generate smooth motions for such short seed frames.

Conclusion
We designed a novel neural network for motion generation, CCNet, to synthesize high-quality motions for multiple subjects.The trained CCNet can capture the motion characteristics of different subjects well and synthesize various types of motion, such as punching.Moreover, CCNet can generate motion for novel skeletons.Given a few sample motions of a novel skeleton, the pretrained CCNet can be fine-tuned to synthesize motions that better reproduce the intrinsic characteristics of the motions of the skeleton.In the future, we plan to extend our method to include skeletons with different typologies.

Figure 1 .
Figure 1.The proposed CCNet can scale up to a large-scale motion dataset across multiple subjects.Left: Examples of motion-capture data and control signals.Middle: A schematic of our network.It consists of a motion-feature embedding encoder, a series of separate residual blocks (light-green blocks) used to capture temporal correlations, and a decoder that maps the latent features to the probability distribution of predicted motions.We omit the skip connections here for simplicity.Right: Examples of motions sampled from our network outputs.

Figure 2 .
Figure 2. The detailed architecture of the separate residual block.Each type of control signal is input to its own Conv1D layer, and the kernel size of Conv1D is 1.The numbers beside O i−1 r , O i r , and O i s indicate their number of channels.

4. 1
Dataset and baselines 4.1.1Dateset (a) Skeletons of different subjects (b) Meshes of different subjects

Figure 4 .
Figure 4. (a) and (b): A motion denoising result for subject 7 (from 0 m 9 s to 0 m 20 s in the video).(c) and (d): A motion completion result for subject 5 (from 0 m 24 s to 0 m 35 s in the video).

Figure 5 .
Figure 5. Trajectory-following results of two subjects (from 0 m 49 s to 1 m 7 s in the video).Left: A synthesized motion transitioning from walking to running and then to zombie walking for subject 10.Right: A synthesized motion transitioning from jumping with the right foot to jumping with the left foot and then to jumping with both feet for subject 11.We use different colors to represent different motion types (refer to the video for details).

Figure 6 .Figure 7 .
Figure 6.Comparisons against ERD-4LR and PFNN (from 4 m 24 s to 4 m 46 s in the accompanying video).The character starts by jumping with the left foot and then changes to jumping with the right foot till the end.The total errors (3,000 frames) between the synthesized trajectories (yellow lines) and input trajectories (green lines) of ERD-4LR, PFNN, and CCNet are 177.143cm, 156.604 cm, and 27.043 cm, respectively.IK is disabled in this experiment.

Figure 8 .
Figure 8.The user interface for interactive control.The green dots on the ground represent the direction-control signal.IK is disabled in this experiment.

Table 3 .
the results of the baseline models in the second row.Mean: the average number of generated sequences selected by all the participants compared to mocap sequences in the same group.Std: the standard deviation of the number that is selected.The average selected numbers for CCNet-generated motion sequences.Baseline vs. CCNet: a group of 16 pairs of motion sequences generated by a baseline model and CCNet.Mean±std: mean and variance of the numbers of CCNet-generated motion sequences selected by all the participants.Groups numbers for CCNet (mean±std) DAE-LSTM vs. CCNet 12.31±2.34ERD-4LR vs. CCNet 11.63±2.87PFNN vs. CCNet 11.45±1.87

Figure 9 .
Figure 9. Foot-ground penetrations in the motions generated by DAE-LSTM, ERD-4LR, and PFNN.Left: 611-th frame generated by DAE-LSTM for subject 9. Middle: 450-th frame generated by ERD-4LR for subject 7. Right: 666-th frame generated by PFNN for subject 8. DAE-LSTM, ERD-4LR, and PFNN cannot effectively differentiate the variations in different skeletons and lead to foot-ground penetrations, as indicated by the red rectangles.

Figure 10 .
Figure 10.Trajectory-following results generated by CCNet for an unseen skeleton subject 7b (from 2 m 3 s to 2 m 15 s in the accompanying video).The skeleton of subject 7b is generated by scaling the lower body of subject 7's skeleton by 0.8.

Figure 11 .
Figure 11.Logarithmic loss curves obtained using different hyper-parameters of CCNet.Left: training.Right: test.We modify each hyper-parameter, including CRL, N CF , with/without skeleton configuration (SK or w/o_SK), and with/without the joint rotations and angular velocities in the data representation (ROT or w/o_ROT).

Figure 12 .
Figure 12.Ablation study of smoothness loss term and SRBs.We remove the smoothness loss term, replace the SRBs with a 1layer LSTM and 2-layer LSTM, and evaluate the logarithmic losses of the corresponding re-trained models.Left: training.Right: test.