CTSN: Predicting Cloth Deformation for Skeleton-based Characters with a Two-stream Skinning Network

We present a novel learning method to predict the cloth deformation for skeleton-based characters with a two-stream network. The characters processed in our approach are not limited to humans, and can be other skeletal-based representations of non-human targets such as fish or pets. We use a novel network architecture which consists of skeleton-based and mesh-based residual networks to learn the coarse and wrinkle features as the overall residual from the template cloth mesh. Our network is used to predict the deformation for loose or tight-fitting clothing or dresses. We ensure that the memory footprint of our network is low, and thereby result in reduced storage and computational requirements. In practice, our prediction for a single cloth mesh for the skeleton-based character takes about 7 milliseconds on an NVIDIA GeForce RTX 3090 GPU. Compared with prior methods, our network can generate fine deformation results with details and wrinkles.


Introduction
Cloth animation is an important problem in computer graphics due to its wide range of applications, including video games, special effects, and virtual try-on.It is regarded as a challenging task due to the model complexity of the cloth and the ability to perform irregular cloth deformations.Furthermore, many applications require interactive performance on commodity hardware, including mobile devices.This problem has been extensively studied in the literature.In order to achieve high-quality and reliable results, many efficient techniques based on physics-based simula- Our network can also handle non-human characters such as a monster (d), a dolphin (e), or even a cat (f).tion (PBS) have been proposed [1,22,4,5,9,29,31].In these methods, the underlying cloth is modeled as a 3D surface mesh subdivided into finite contiguous triangles, and they use collision handling methods to generate accurate simulations.However, these methods cannot provide realtime frame rates for interactive applications.
There has been considerable work on using machine learning methods to significantly reduce the computational cost of predicting cloth deformation.Many learning-based networks [20,3,23] have been proposed for SMPL-based parametric 3D human models [16].These SMPL-based methods are used to generate smooth deformations for humans moving with tight-fitting clothes.The prediction is generated in real-time because of the small number of parameters used in SMPL-based networks.However, the SMPL-based model is limited and cannot be used on arbitrary objects or characters used in games.In order to handle more general characters and enhance the quality of prediction, other algorithms use multi-layer perceptron (MLP) models on the vertices of the cloth mesh to learn the deformation [33].Without using the topologies of a cloth mesh, such MLP-based method tends to train a network with a large number of parameters, which increases the memory overhead and the runtime cost.Recently, Graph Convolutional Networks (GCN) have been used to predict the cloth draping results on the human characters [8,7,32].In practice, these methods need the pre-deformed cloth for the target pose [8,7] or can only process the draping results on human characters in a T-pose [32].
In this paper, we deal with skeleton-based characters, which are widely used in computer games and other interactive applications.These include human-like characters (such as leading roles), monster characters similar to humans (such as trolls), and animal characters (such as pets).All these different characters can wear different types of clothes.We propose a learning-based cloth skinning model to capture the coarse and wrinkle features to obtain the final cloth deformation.Our approach is general and designed for all types of skeleton-based characters, including humans and animals.Furthermore, these characters can be dressed with loose or tight-fitting clothes.
Our formulation models the cloth draping deformation as the skinning of the cloth template at a canonical pose (such as a T-pose or a A-pose).Given the skeleton information and mesh information of the posed character, the deformation of the cloth is computed by skinning weights and the template cloth mesh.In order to handle different skinning characters and cloth, we design a novel two-stream network architecture to learn the residual positions of vertices of the cloth template mesh.It consists of a mesh-based residual stream and a skeleton-based residual stream.The skeletonbased residual stream is trained to obtain the coarse residual on the cloth template mesh, while the mesh-based residual stream is trained for the wrinkle features.The prediction examples of our two-stream skinning network are as show in 1.
We qualitatively and quantitatively analyze the performance of the proposed two-stream skinning network in a variety of scenarios.These include human-like characters and other characters.We validate our two-stream network thorough the ablation experiments.Compared with recent methods, our two-stream network can capture the fine de-tails of the cloth deformation.
The novel components of our work include: • A learning-based cloth skinning model: Our approach models the cloth deformation as the learningbased skinning of the template cloth mesh.Our skinning model is not limited to humans and can process many skinning characters.
• Two-stream skinning network architecture for cloth deformation prediction: Based on the learningbased cloth skinning model, we design a novel twostream network architecture for cloth deformation prediction.The architecture consists of a mesh-based residual stream which is trained for wrinkle features, and a skeleton-based residual stream which is trained for coarse features.
• Ability to process different types of clothes and characters: Our network can process various types of characters and clothes.These characters and clothes can vary considerably.
We show the prediction results of our proposed skinningbased network on different human characters, non-human characters with different cloth types in Section 5. We compare our method qualitatively and quantitatively with other methods in Section 6.We can predict deformed clothes at averagely 7 milliseconds on an NVIDIA GeForce RTX 3090 GPU.As compared with prior approaches, our method can predict the deformation results with fine wrinkles and details.

Related Work
In this section, we give an overview of cloth deformation prediction using traditional PBS methods and recent learning-based methods.Many learning-based methods are limited to the SMPL model; we describe these methods in Section 2.2 and highlight other learning methods in Section 2.3.

Physics-based Simulation
PBS methods for generating deformed cloth are commonly based on the pipeline of time integration [1], collision detection [5,29], and collision response [5,9,31].While they can accurately model the deformation and result in non-penetrating simulations, the running time is not fast enough for interactive applications.To accelerate the simulation, recent research tends to use GPU-based algorithms to parallel the pipeline [30,13].However, current methods can simulate each frame in hundreds of milliseconds on high-end desktop GPUs.Moreover, the performance of these simulators depends on various parameters, such as material attributes, which are hard to fine tune.Many learning methods have been proposed based on SMPL-based parametric 3D human models.[16] proposed parametric skinning human models using SMPL, where the deformation of the human body mesh is driven by the skinning skeleton of the template body mesh.[20] regard the cloth mesh as the sub-mesh of the SMPL body mesh, and use an indicator matrix to select the associated vertices on the body mesh as the initial state.The proposed network, TailorNet [20], is trained as an increment from the initial state to represent the template cloth mesh.This is used to perform skinning operations to obtain the final deformation on the target pose.[14] use the skinning body mesh directly on the target pose as the initial state and learn a graphattention-based network to predict the residual between the initial state and the final deformed cloth mesh with wrinkles.These methods use the vertices on the unposed template body mesh or posed target body mesh as the initial state of the deformed cloth mesh and train different networks to fit the residuals of the ground truth.Therefore, the predictions of these methods may not generate plausible results on some loose-fitting clothes such as dresses, because the vertices may be far away from the body mesh.
Other algorithms have been proposed that treat the cloth mesh deformation as a skinning deformation similar to the body mesh skinning [11,16].These methods tend to build a skinning model for cloth deformation from the canonical template cloth mesh.[23] use a garment fit regressor and a garment wrinkle regressor to learn the nonlinear residuals of the ground truth from the canonical cloth mesh.To enhance the performance on loose-fitting clothes, [25] smoothly diffuse the skinning parameters of neighbors for each vertex on the unposed cloth mesh.They propose an optimization-based strategy to project ground-truth garments to the canonical space without introducing collisions.However, the diffusion of the skinning parameters is only operated on the unposed canonical cloth, which makes the improvement of the predictions on the loose-fitting clothes limited.[3] use GCN to extract features on the unposed canonical cloth mesh to learn the blend weights.These methods ignore the impact of the poses on the skinning weight parameters.In practice, all these networks are constrained by the pose and shape parameters of SMPL.

Learning-based Cloth Deformation
Many learning-based methods have been proposed for general cloth meshes and characters that are not limited to SMPL-based representations.[8,7] use dual quaternion skinning (DQS) [11] to generate the pre-deformation of the cloth template from the canonical pose and use GCN blocks to learn the residuals from the pre-deformation to the ground truth cloth mesh.[10] use the PCA to obtain the subspace of the cloth and the obstacle and use MLP to regress the non-linearity in subspace deformation.Unfortunately, using the previous predictions as the input of the subsequent predictions will accumulate the error and hinder the quality of the result.[33] only use the vertex coordinate of the cloth mesh to learn a cloth descriptor that can be fused with motion in latent space.Considering the difficulty of predicting the cloth deformation caused by body pose, [32] use an encoder and decoder architecture with GCN to learn the draping effect of different cloth types on the canonical pose.Other methods are designed for general triangle mesh-based obstacles [10,15].
Many techniques have been proposed to estimate a collision-free subspace of general 3D deformable models and used to compute collision-free cloth configurations [27,28].For human-like characters, many learning methods [8,2] use collision loss to penalize penetrated garment-body pairs during training.Our approach for handling arbitrary characters and clothing types is complimentary and can be combined with these methods.

CTSN: Our Approach
Our approach takes a skeleton-based character of the target pose and cloth template of the canonical pose as input and predicts cloth mesh deformation for the target pose character through a skinning-based network.The skeletonbased character of the target pose has the skinned mesh and the transformation information of the joints.The key concept of our approach is a novel skinning-based cloth model.We propose a network architecture composed of two residual networks based the cloth model.We present the details of our skinning-based cloth model and the network architecture in following sections.

Skinning-based Character Model
Our skinning-based cloth model is inspired by the skinningbased character model, SMPL [16].We give a brief overview of the SMPL model and the symbols used in the rest of the paper.
In the standard skeletal rigging, the posed character is calculated by the following formula: where M B (γ) is the posed character mesh; T B is the template character mesh at the canonical pose; J is the skeleton of character; γ is the transformation matrix of the character joints; W B is the skinning weight matrix; and W(•) is the skinning function.The parametric skinning human model SMPL [16] uses a set of orthonormal principal components of shape and pose displacements to capture the soft-tissue dynamics.This model is represented as: where β and θ are the shape coefficients and the pose vector, which contains the transformation information of the joints, respectively.J(β) is the skeleton position with shape coefficients β.T B (β, θ) is the template human mesh, which is the function of β and θ.To capture the soft-tissue dynamics, body shape blend offsets B S (β) and pose blend shapes B P (θ) are fused to the initial template human body mesh T B to generate the final template human mesh T B (β, θ).

Our Two-stream Skinning-based Cloth Model
Cloth deformation is driven by the character motion since cloth is dressed on the surface of a character mesh.To simplify the deformation problem, we use a skinning-based model for the template cloth mesh to guide the deformation.
Inspired by the SMPL model and other approaches [23], we present a new method to build a skinning based model for cloth deformation.Thus, given a template cloth mesh T C at the canonical pose and the skeleton transformation matrix at the target poseγ, the deformed cloth M C (γ) is defined as follows: where γ is the transformation matrix of the joints of the target character body.W C is the skinning weight matrix for cloth template mesh T C .For the skinning function W(•), LBS(•) represents the linear blend skinning (LBS) method [17], which is widely supported by game engines.T C (γ) is the optimized template cloth mesh at the canonical pose.∆ S (γ) is the skeleton-based residual positions trained to obtain the coarse features.∆ M (γ) is the meshbased residual positions trained for adding wrinkle details to the coarse prediction.We highlight our two-stream network architecture in Fig. 2.
Our network architecture consists of a mesh-based residual stream and a skeleton-based residual stream.The meshbased residual stream is designed to compute the impact of the nearest vertices of the cloth on the posed character mesh on the cloth template mesh, i.e. ∆ M (γ), while the skeleton-based residual stream is used to model the influence of skeleton information of the character to the cloth template mesh, i.e. ∆ S (γ).Since the cloth type can be tight or loose, we train the skinning weight matrix W C for different types of cloth.We present more details in Sec-tion 3.2, 3.3, and 3.4.In general, our network architecture can be expressed as: where N σ is the skinning-based network and σ represents the trainable parameters.Similar to TailorNet [20], we decompose the deformed cloth mesh to the low-frequency and the high-frequency deformations.To obtain the low-frequency of the cloth mesh, we perform the Laplacian smoothing to the simulated cloth mesh.The high-frequency deformation is residual wrinkle details.

Skeleton-based Residual Stream
In our skeleton-based residual stream, the input is the transformation matrix γ of character joints at the target pose.We pass the transform matrix γ into the pose embedding network, which is composed of an MLP, to learn the pose embedding P = {P 1 , P 2 , P 2 , • • • , P m }, where m is the size of the embedding vector P: where Φ(•) is the MLP-based pose embedding network.
After the pose embedding, our goal is to learn a set of character residual matrices for the character and cloth pair.As for matrix B j , where j ∈ {1, 2, • • • , m}, B j can be expressed as: To train the skeleton-based residual stream to obtain the coarse features, we use the obtained low-frequency deformation as the ground truth.

Mesh-based Residual Stream
The skeleton-based residual stream can only predict the position offset ∆ S (γ), which captures the coarse features of the target deformation.The prediction results of the skeleton-based residual stream are smooth.To improve the prediction, we use a mesh-based residual stream to learn the wrinkle residual for the final cloth deformation.
We build a KD-tree for the template cloth mesh and the body mesh at canonical pose.We use this tree data structure to find the nearest point index I C on the body mesh for each vertex on the cloth mesh.Given the input transform matrix of the skeleton of the body, we can obtain the skinned body mesh at the target pose by using our skinning method.We obtain the positions V of the nearest points through the selected index I C .In order to improve the effectiveness of our mesh-based residual stream, we also build the reference mesh graph M V = (V, E, A), where V corresponds to the nearest vertices computed previously as the nodes of the graph M V ; E ⊆ V × V corresponds to the edges of the template cloth mesh, and A is the (0, 1) adjacency matrix that highlights the connectivity of the vertices V.
We use the Graph Transformer network [26] to extract features on the predefined constructed mesh graph M V .The architecture of the mesh-based residual stream is illustrated in Fig. 3.In the Graph Transformer layers of our mesh-based residual network, we define as the node features of previous layer l, where n is the number of nodes.h l i ∈ R F represents the features of node i in layer l whose dimension is F .h l j represents the features of node j in layer l, where node j is the neighbor of node i.The multi-head attention features f (l) c,ij of head c from node j to node i are computed as follows: where c,k , and b c,e are trainable parameters.e ij represents the edge features.
After normalization, the multi-head attention coefficients α (l) c,ij of head c from node j to node i are computed as: where d is the hidden size of each head.The output features ĥ(l+1) of the node i in layer l + 1 are calculated by the following formula: where C is the number of the head.W In order to improve the ability of the feature extraction, β (l) i is calculated as follows: i ; ĥ(l+1) − r Thus, the final output features of the node i in layer l + 1 are updated as: As shown in Fig. 3, we use the Graph Transformer network to extract features of the mesh graph.After the feature extraction on the mesh graph, we use a vertex level MLP and a set of trainable mesh matrices to obtain the wrinkle residual positions.The trainable mesh matrices are represented as {M 1 , M 2 , M 3 , • • • M k }.ReLU(•) is used to match the nonlinearity of the high-frequency deformation.∆ M (γ) is computed from the mesh graph M V as: where Ψ(•) represents the mesh-based residual stream.Similar to the skeleton-based residual stream, we use the high-frequency deformation as the ground truth to train the mesh-based residual stream.

Skinning Operation
After obtaining the skeleton-based residual component ∆ S and the mesh-based residual component ∆ M , we compose a new optimized template cloth mesh T C (γ).
To solve the impact of cloth types (tight-fitting or loosefitting) on the final prediction results, we learn a weight residual ∆W C for different cloth types.∆W C is represented as: where w 00 , • • • , w nk are trainable parameters and k is the maximum number of joints.The fusion skinning weight matrix is generated as: where W I C represents the initial cloth weight obtained from the template body skinning weight W B through KD-tree.
In general, pose embedding function Ψ(•) and D are trained by skeleton-based residual stream for the coarse deformation, while Ψ(M V ) is trained by mesh-based residual stream for the wrinkle deformation.W C is trained for processing different types of cloth.

Loss Function
To optimize the parameters of our network architecture, we use the following loss function to minimize the difference between the predicted deformed cloth mesh and the ground truth: where x j p is the predicted position of vertex j on the deformed cloth mesh M CP .x j g is the position of vertex j on the ground truth cloth mesh M CG .N is the number of vertices of cloth mesh M CG .∥• • • ∥ 2 is the L 2 distance.b is the batch size.

Dataset and Implementation
In this section, we describe the generation of our dataset and some implementation details.

Dataset
We have generated many different characters and clothing types to validate our network architecture (as shown in Fig. 4).We upload the character meshes of Andy and Qman in canonical poses, such as a T-pose, to the motion capture website Mixamo1 .We download many character poses computed from that website as FBX files.To eliminate the absoluteness of the vertex position and make it easy to train our network, we move the hip joint of the character mesh to the origin of the coordinates.Next, we extract the transformation matrix γ of the character at different poses and the skinning weights W B from the FBX files.After extracting of the motion files, we use the skinning operation to obtain the character meshes at different poses with transformation matrix of joint γ.We use different clothing types such as a T-shirt, dress, and robe.The T-shirt is tight-fitting, and the dress and robe are loose and can result in complex deformations.In order to compute the ground truth of the deformed cloth, we use the physicsbased simulator ArcSim [19,18,21] to simulate the cloth.During the simulation, we perform linear interpolation between the adjacent poses and relax the cloth mesh to compute the quasi-static deformation.
To evaluate that our network can process more complex and different characters, we applied our network on nonhuman characters such as a monster, a dolphin, and a cat.The monster character has a skeleton similar to the human character, while the dolphin and the cat have different skeletons.The dolphin character has no leg joints, while the cat model has four legs without hands.We can also simulate the cloth deformation on these characters.The monster character wears a loose robe, and the dolphin and the cat wear tight-fitting clothes designed for these characters.
The attributes of the skinned character and cloth meshes are shown in Fig. 4. We have highlighted the number of triangles of each character mesh and cloth mesh, the number of joints of the character, and the number of samples used by our algorithm.

Network Implementation and Training
We train our network on a standard PC (Ubuntu 20.04 LTS/Intel I7 CPU@4.2GHz/8G RAM, NVIDIA GeForce RTX 3090 GPU).Our network is implemented using Py-Torch 1.7.0 and Python 3.8.8.
Following [20] and [33], we also split our dataset for training and testing.For the motion clips obtained from Mixamo, we split 90% motion clips as training data and the last 10% motion clips as the test data, which are unseen during training.
We train our network on the dataset containing different characters and cloth types.As shown in Fig. 4, our dataset has 5 skeleton-based characters (2 human characters and 3 non-human characters) with 7 different types of cloth.During training, we set the learning rate at 1e − 3 and use an Adam optimizer [12] to train the parameters of the neural network.

Penetration Handling
It is hard to obtain collision-free predictions or configurations with learning-based methods on the test data, which is unseen during training.We use a method similar to [33] to reduce the penetrations between the cloth and the character.After the prediction, the predicted deformed cloth mesh is optimized by minimizing the following function to avoid penetrations between the cloth and the character: where V pene is the set of penetrated vertices of predicted cloth.For each penetrated vertex v i , the closest point vertex v B i and normal n B i are computed over the character mesh.E B is the error between penetration vertices on the cloth and the character mesh.and ϵ is a small step to pull out the penetrated vertices from the character mesh.During the optimization process, the positions of V pene are updated, which reduces the number of localized penetrations or collisions.

Results
In this section, we highlight the deformation prediction results of our network on the unseen test data.We compare our predictions on the unseen test data with the ground truth results obtained using a physics-based simulator (ArcSim).

Predicted Deformation using Our Network
Fig. 5 shows the predicted T-shirt deformation at different poses for the character Andy.Our predictions show the fine details with wrinkles, similar to those in the ground truth deformation.We also show the prediction results of other types of cloth and another character, Qman, in Fig. 6.Fig. 6 (a) shows the predicted deformation of the dress on the character Andy, while Fig. 6 (b) and (c) show the cloth deformation on the other character, Qman.The dress on the character Andy in Fig. 6 (a) and the robe on the character Qman in Fig. 6 (c) are both loose-fitting types of clothing.These predictions validate the effectiveness of our network.Since we train the mesh-based residual stream and skinning weight for each clothing type, the deformation details can be easily captured, enhancing the predictions.
Our network can also process other non-human characters with skeletons.The predicted results of our network and the ground truth on the non-human characters are  shown in Fig. 7. Fig. 7 (a) shows the result of our network on a non-human character, Monster.The skeleton hierarchy of Monster is similar to the human characters in Fig. 6.To show the complex characters that our network can process, we highlight the results of our network on the Dolphin character in Fig. 7 (b) and the Cat character in Fig. 7  (c).The Dolphin has no arm joints or legs joints, while the Cat has four legs without arms.The cloth on the Dolphin and the Cat are designed specifically for these characters.The results of our network show the fine predictions of the cloth deformations on these non-human characters.As for  the non-human characters, the deformation of loose-fitting cloth is also well predicted.Fig. 8 shows the loose-fitting cloth dressed on the character Fox.Our network can predict the coarse deformed cloth and the fine detailed one.
The results of cloth deformation on the Dolphin, the Cat and the Fox show the capability of our network processing non-human characters.The prediction of deformation can catch the fine wrinkle details.
Fig. 9 shows the result of penetration handling described in Sec.4.3.After post-processing, the penetration between the back of the character fox and the dressed cloth is eliminated and the penetration-free result is obtained.

Prediction Runtime
We can perform cloth deformation prediction with our network both on GPUs and CPUs.We have highlighted the runtime of predicting a single cloth mesh in Table .1.The runtime for a GPU is collected on an NVIDIA GeForce RTX 3090 GPU.The runtime for a CPU is collected on an Intel I7 CPU.As shown by the table, we can perform a single prediction within 7ms on a GPU, which is much faster than prior learning-based [15] or physically-based algorithms [13].The running time of our deformation prediction algorithm on CPU in less than 0.2s.

Comparisons
In this section, we qualitatively and quantitatively compare the results of our network with prior learning-based methods.We also perform some ablation experiments to validate the effectiveness of our network.

Comparisons with Prior Learning Methods
Many approaches have been proposed to predict cloth deformations using learning-based networks.We have highlighted many recent methods and their attributes in terms of handling different kinds of characters and clothing types in Table 2. Some methods [20,3,2,6] are based on the SMPL model, which limits them to only processing SMPL human bodies.[23,25] are also based on the SMPL model.However, it is possible to extend them to remove the dependence on SMPL-based representation.Therefore, we modify these two methods and compare their results with our method in the following sections.[10] uses PCA to extract the principal components of the character vertices and cloth vertices to learn the relationship with the next deformation in the subspace.However, this method uses the previous prediction as the input for subsequent predictions and may result in accumulated errors.[8,7] use DQS [11] to pre-deform the cloth mesh from the canonical pose to the target pose and then use a learning-based network to predict the residual of the pre-deformed cloth mesh and ground truth.This method only works well on tight-fitting cloth, and its predictions tend to be smooth and may lose wrinkle details. [33] use MLP to learn the intrinsic features for cloth vertices and character vertices, which results in a model with many redundant parameters.Furthermore, these methods are mostly limited to one or many specific characters or clothing types.In contrast, our network can overcome these limitations and is more general.

Qualitative Comparisons
We have implemented the modified versions of [3] and [2] to process the non-SMPL characters.We replace the SMPL skinning method with a character skinning method, which is based on using skeletons.The modified version of [2] is an unsupervised method.[3] contains the supervised part and the unsupervised part in its network.We have compared our network with the supervised part of [3].Fig. 10 shows the comparison between the prediction of PBNS [2], DeePSD [3], and our method.As shown in Fig. 10, the PBNS method [2] tends to predict the deformed cloth mesh, which is tightly wrapped on the character and can introduce artifacts in the deformation.The DeePSD method [3] tends to predict smooth deformations, resulting in penetrations with the character even after postprocessing.This implies that the prediction of DeePSD [3] is driven less by the transformation matrix of the character.In contrast, the results of our method tend to generate deformations with fine wrinkles.We have also implemented the learning algorithm [15] and obtained similar results with our method.The prediction results of [15] can also generate fine wrinkles.However, [15]  Our method Table 2.We compare the characteristics and features of our approach with prior methods.We highlight the unique capabilities of our approach.

Quantitative Comparisons
We also perform quantitative comparisons between our method and previous methods.We use the following error metrics to evaluate the prediction results of our network and others.
where x i p is the position of vertex i of the predicted mesh P .x i g is the ground truth of vertex i. N is the number of vertices of the cloth mesh.n i p and n i g are the normal vectors of vertex i on the predicted mesh and the ground truth, respectively.The calculated error metrics are shown in Table 3.The results generated from our network are more accurate than PBNS [2] and DeePSD [3].

Evaluation
We also compare the memory footprint (i.e., number of parameters used) of different networks in Fig. 11 by measuring the model size.Compared with [15], whose memory footprint is 928.8MB, the memory footprint of our method is much less (36.5MB).The memory footprint of DeePSD is 3.22MB, and PBNS is 30.4MB.

Ablation Experiments
To validate the effectiveness of our network architecture, we implement a series of ablation experiments.Fig. 12 shows the results of the modified network without some parts of the overall architecture.The parameter m for the skeleton-based residual stream and k for mesh-based residual stream also impacts the performance.We have used different values of m and k to train our network.With the increase in the value of m, the prediction of our network becomes more accurate.However, the memory footprint also increases, which increases the model size of our network.We show the relevant memory footprint of our network on the scene of Qman dressing The second and third columns are the results of [2] and [3].The last column is the result of our method.The top and bottom rows are the front and back views of the deformed predictions.

Conclusions, Limitations and Future Work
We present a two-stream skinning-based network to predict cloth deformation from a template cloth in a canonical pose.Our method can process different characters and cloth types retaining the fine details.Since our network is  based on the skinning operation, the memory footprint of our method is low.The runtime performance of our network is fast, and we can predict a single cloth deformation in 7ms on a desktop GPU.
Our approach does have some limitations.Like prior learning-based methods, collision-free predictions are not guaranteed by our network.As part of future work, we would like to overcome the above limitations and extend our work to unsupervised networks [3] or self-supervised networks [24].In addition, our method tend to train a specific model for each character due to the difference between human and non-human characters.

Figure 1 .
Figure 1.Given a skeleton-based representation of a character corresponding to target poses and different types of cloth (loose or tight-fitting), we use a two-stream skinning network to predict the cloth deformation for the target character.(a) and (b) correspond to the same human character with tight and loose-fitting clothing, respectively; (c) is a different human character wearing a long robe.Our network can also handle non-human characters such as a monster (d), a dolphin (e), or even a cat (f).

Figure 2 .
Figure 2. Our network architecture is composed of the mesh-based residual stream and the skeleton-based residual stream (shown as the green blocks) to obtain the wrinkle residual ∆M (γ) and the coarse residual ∆S(γ).γ is the transformation matrix of the target pose.The updated cloth template mesh TC (γ) is used by the skinning operation to obtain the final deformed cloth mesh MC (γ).
where b 00 , • • • , b 02 , • • • , b n0 , • • • , b n2 are trainable for the target character and cloth.n is the number of vertices of the template cloth mesh.Finally, the pose embedding P is fused as the weights to the residual matrix D to obtain the skeleton-based residual component ∆ S (γ):

Figure 3 .
Figure 3.The architecture of our mesh-based residual stream.We use Transformer Graph Convolutional Network to extract features of the reference mesh graph MV = (V, E, A).The extracted features are transmitted to vertex level MLP layers and trainable mesh matrices to obtain the wrinkle residual.
are trainable parameters.N (i) is the neighbors of the node i. ∥ is the concatenation operation for C head attention.

Figure 4 .
Figure 4.The attributes of different characters and clothing types used for our evaluation.We obtain different poses of characters from the Mixamo website.We extract the transformation matrix and skinning weight from the motion files.We use the cloth simulator ARCSim to precompute the deformed cloth mesh for training.

Figure 5 .
Figure 5.The predicted deformed T-shirt dressed on the character Andy in different poses.All the input poses are unseen during the network training.The top row shows the ground truth of the deformation, while the bottom row highlights the predictions of our network.We also highlight the fine details and folds in the zoomed images.

Figure 6 .
Figure 6.The predicted deformed cloth on other human characters.The first column shows the prediction on the character Andy.The middle and last columns show the deformation predictions on the character Qman.

Figure 7 .
Figure 7.The results of our network on non-human characters.The first column shows the deformed robe on the Monster, whose skeleton is similar to that of human characters.The middle column shows the deformed cloth on the Dolphin, which has no legs.The last column shows the cloth on the Cat, which has no arms.

Figure 8 .
Figure 8.The results of our network on non-human character fox.There is a loose-fitting cloth dressed on the character fox.The first row shows the target pose of the character fox.The second row shows the ground truth.The third row shows the coarse prediction.The last row shows the fine detailed prediction.

Figure 9 .
Figure 9.The results of penetration Handling.The left is the situation of penetration between the character fox and the loose-fitting cloth.The right is the penetration-free result.

Fig. 12 (
a) is the ground truth of the deformed cloth.Fig.12 (b) is the cloth skinning deformation only with the fixed initial skinning weight.With the fixed skinning weight, there are artifacts on the skinning deformation, such as legs and belly.Fig.12 (c) is the result with the skeleton-based residual stream and trainable cloth skinning weight.The deformation in Fig.12 (c) tends to obtain the coarse residual.Fig.12 (d) is the result of our full network architecture with the skeleton-based residual stream, the mesh-based residual stream and trainable cloth skinning weight.Compared with the result of Fig.12 (c), Fig.12 (d)shows that our mesh-based residual stream can capture the fine details of the final deformation.Fig.12 (e) is the result of our network without the trainable cloth skinning weight.Without the trainable skinning weight, the skinning result tends to predict more artifacts.There are folds on the legs similar with the result of Fig.12(b).

Figure 10 .
Figure 10.Comparison of results between our network and previous methods.The first column is the ground truth of the deformed cloth.The second and third columns are the results of[2] and[3].The last column is the result of our method.The top and bottom rows are the front and back views of the deformed predictions.

Figure 11 .
Figure 11.Our approach can is general in terms of handling all skeleton-based models and meshes, but has low memory overhead. .

4 .
We choose m = 32 and k = 128 by experiments and find that increasing m and k does not obviously improve the results.

Table 1 .
The average CPU and GPU runtime for a single cloth mesh prediction.
uses significantly higher memory footprint (about 928.8MB).

Table 3 .
We compare the mean and standard deviations of mesh errors on test samples based on the ground truth computed from physics-based simualtors.

Table 4 .
Memory footprint with different m and k.