Deep fusible skinning of animation sequences

Moutafidou, Anastasia; Toulatzis, Vasileios; Fudos, Ioannis

doi:10.1007/s00371-023-03130-3

Deep fusible skinning of animation sequences

Original article
Open access
Published: 06 November 2023

Volume 40, pages 5695–5715, (2024)
Cite this article

Download PDF

You have full access to this open access article

The Visual Computer Aims and scope Submit manuscript

Deep fusible skinning of animation sequences

Download PDF

Anastasia Moutafidou¹,
Vasileios Toulatzis¹ &
Ioannis Fudos¹

941 Accesses
1 Citation
Explore all metrics

Abstract

Animation compression is a key process in replicating and streaming animated 3D models. Linear Blend Skinning (LBS) facilitates the compression of an animated sequence while maintaining the capability of real-time streaming by deriving vertex to proxy bone assignments and per frame bone transformations. We introduce a innovative deep learning approach that learns how to assign vertices to proxy bones with persistent labeling. This is accomplished by learning how to correlate vertex trajectories to bones of fully rigged animated 3D models. Our method uses these pretrained networks on dynamic characteristics (vertex trajectories) of an unseen animation sequence (a sequence of meshes without skeleton or rigging information) to derive an LBS scheme that outperforms most previous competent approaches by offering better approximation of the original animation sequence with fewer bones, therefore offering better compression and smaller bandwidth requirements for streaming. This is substantiated by a thorough comparative performance evaluation using several error metrics, and compression/bandwidth measurements. In this paper, we have also introduced a persistent bone labeling scheme that (i) improves the efficiency of our method in terms of lower error values and better visual outcome and (ii) facilitates the fusion of two (or more) LBS schemes by an innovative algorithm that combines two arbitrary LBS schemes. To demonstrate the usefulness and potential of this fusion process, we have combined the outcome of our deep skinning method with that of Rignet—which is a state-of-the-art method that performs rigging on static meshes—with impressive results.

Temporal Parameter-Free Deep Skinning of Animated Meshes

Improving the Perceptual Quality of 2D Animation Interpolation

A perceptual quality metric for dynamic triangle meshes

Article Open access 25 January 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An animated 3D model is composed of a geometry structure called mesh, a set of bones that comprise its skeleton and their kinematics. In computer graphics, animations play a cardinal role in games and applications of virtual, augmented and mixed reality. However, even nowadays the process of generating a new animation is a cumbersome task that entails experienced animators to place a lot of effort following the traditional way of animating a static 3D model so that we finally have a visually correct result.

To produce an animation, an animator has to conduct rigging of a static mesh (e.g. create and apply a bone structure to vertices), define and apply transformations to bones across a time sequence of poses or frames, alleviate artifacts and emulate nonlinear deformation effects. Some of these steps may be performed fully or semi automatically, due to recent advances in computer vision and tracking techniques. However, the process of creating realistic results for every frame of the animation is complex and may require computationally intensive processing and manual interventions. Therefore, the workflow for producing realistic animation results usually produces a sequence of meshes, also known as animation sequence or animation meshes.

Due to the evolution of cloud-based graphics applications such sequences have to subsequently be converted to allow for streaming and editing. To this end, an efficient compression method constitutes a key factor for efficiently storing, playing and modifying animations. Linear Blend Skinning (LBS) [30] and its variations are efficient compression methods that produce an approximation of the animation sequence that consists of an initial pose and a number of transformations on bones that describe each subsequent pose. Subsequently, surface deformation is determined for each pose since every mesh vertex is influenced by a set of bones.

Despite several limitations that have been addressed in the literature [13], LBS is still used in modern systems since it is simple and straightforward to be developed and computed on GPUs. A comprehensive and self-explanatory way of anticipating its importance is by providing an example. For reproducing an uncompressed animation sequence over the internet of a model with 8k vertices and 48 frames, a bandwidth of $\approx $ 48 Mbps is required with a playing speed of three frames per second (fps) and double precision on arithmetic values for streaming it. The same model with LBS skinning using ten bones with the same playing speed and precision requires a bandwidth of only 1.7 Mbps instead, achieving a compression rate of $\approx $ 96.5%.

There exists a variety of approaches for compression using clustering techniques, most of which are based on geometric features of vertices over time [4, 8, 14, 16,17,18, 25, 27]. An alternative to this are deep architectures that generate a skeleton and its weights by an exemplar-based deep neural network which is given an input mesh and produces a skeleton and a set of skin weights [28, 45].

We introduce a novel machine learning approach that trains a deep neural network by a training set of animation sequences produced by successfully fully rigged animated models. More specifically, the trained network captures vertex trajectories across frames and produces some initial weights that are fed in an optimization scheme to produce a set of weights and transformations that yield results approximating the original animation sequence. Thus, given a new animated mesh sequence, the pretrained network predicts proxy bones and their vertex to bone weight values. This is accomplished by deriving an efficient set of bone to vertex associations based on similarities of the vertex trajectories of the input sequence to trajectories that have been learned successfully by our artificial neural network (ANN) and can therefore be associated with bones based on the rigged animations of the training dataset. Moreover, we additionally improve the efficiency of the least square optimization of transformations and weights that is commonly used to reduce the approximation error by employing conjugate gradient optimization that is suitable for multidimensional systems.

Our method has no limit on the number of vertices, faces and frames given as input. There is just an upper limit on the number of bones for all animations which is set high enough so that a very large domain of 3D animations are supported.

In contrary with previous skinning approaches, our approach is parameter free, in the sense that it does not require providing a predefined number of bones or setting any other tuning parameter.

In our method, geometry preprocessing (translating, scaling or rotating) of our input 3D model is not needed due to the usage of vertex trajectories. As a consequence, only the relative vertex movement is taken into consideration by our deep neural network architecture.

For improving performance and additionally enhancing the capability to capture and associate vertex trajectories to the most suitable bones, we have introduced a persistent labeling scheme for our training set models which is explained in Sect. 3.1. Therefore, bones of the rigged models are labeled using a persistent labeling scheme that is based on the skeleton tree structure and the distance from the root. This novel technique improves performance of deep skinning and facilitates fusion and skeleton awareness in LBS schemes.

Our method derives bones, determines the influence of bones on vertices through weights and computes the value of each bone transformation. Moreover, the mesh can be segmented to areas of bone influence by utilizing the vertex to bone weight information derived by our method. Based on this information we can derive a skeletal rig for an animation sequence. Therefore, the output of our method can be converted to a fully rigged animated character and used in subsequent phases of animation editing and rendering.

Furthermore, we have introduced a process, called fusion, for combining two LBS schemes $\alpha $ and $\beta $ that have been derived with different methods. This works in two ways: (i) we derive an LBS scheme $\beta -\alpha $ with the number of bones of $\alpha $ but with improved bone to vertex associations and weights based on scheme $\beta $ or (ii) we derive an LBS scheme $\alpha -\beta $ with the number of bones of $\beta $ but with improved bone to vertex associations and weights based on scheme $\alpha $. By doing so we can take advantage of the strengths of two or more skinning schemes. To demonstrate the potential of this approach, we have combined our deep skinning method with RigNet [45] with impressive results.

The comparative evaluation of our method is conducted by using mesh sequences that are derived both from rigged animated characters and benchmark mesh sequences from available animation sequence datasets. Both in terms of error measures and bandwidth requirements our method outperforms most previous approaches. In addition, fusing skinning information from (i) Rignet [45] which is used to perform deep skinning based on static mesh morphology and (ii) our vertex trajectory-based deep skinning approach derives two new LBS schemes: one that is more precise than its fused counterparts and one that performs close to the fused counterparts but with much lower bandwidth requirements.

In summary, the paper makes the following technical contributions:

Introduces a parameter free deep skinning approach for producing LBS schemes for animation sequences with low error, high compression and low bandwidth requirements.
The persistent proxy bone labelling facilitates compatibility with skeleton-based animation schemes.
We introduce a fusion approach that combines the benefits of two different LBS schemes.

The rest of the paper is structured as follows: Sect. 2 provides an overview of related work. Section 3 presents a description of our Deep Skinning method. Section 4 offers a thorough comparative experimental evaluation of visual quality, error and compression rate of our method versus previous competent approaches. Finally, Sect. 5 provides conclusions and future research directions.

2 Related work

Despite the extensive research conducted in the field of skinning of animated models, the exploration of deep learning methods for accurate and efficient skinning has been insufficient. One of the most well-known techniques for this procedure is the Linear Blending Skinning (LBS) method which has limitation and in some cases it produces some structural defects. Some artifacts such as the collapsing-elbow and the candy-wrapper effect have been successfully addressed in the literature (see e.g. [42]).

In the field of mesh animation, there are a lot of approaches that can confirm that animating a character can be a tedious task. Processes which have as main goal the conversion of a 3D animated model into an animation sequence or cross-parameterization techniques [22] are proved to demand high memory and space. This renders a set of methods that their hardware requirements are high and in some cases very slow because of this demand. Thus, a need for approaches without a skeleton or skin specified [14] which have the ability to produce compressed animated characters is of the essence. Other methods show that complexity of an animated character is possibly so high that processing of it needs computational high cost and cumbersome animator interference [44, 46]. Kakadiaris and Metaxas [15] on the other hand presents a method where utilizing a sequence of images they try to recover the parameters of the movement through an integrated system.

In this context of high computational and memory cost, animation compression is the key solution. The first work toward this direction exploring LBS usage was made by James and Twigg in SMA [14]. In this method, the authors attempt to approximate articulated characters by introducing pseudo bones that represent vertex clusters formed based on the properties of the vertex transformation matrix. Kavan et al. in SAD [16, 17] set in place an extension of the aforementioned work introducing a dual quaternion skinning scheme which has the ability of approximating highly deformable animations by uniformly mesh points selection acting as bones. Improved skinned approximation result are adopted by both techniques (SMA and SAD) by exploiting EigenSkin [23] which is a PCA-based correction method. Eigenskin enhances skin approximation by removing distortion but with higher space needed. However, both methods provide limited editing of the mesh sequence by either (non-inherited) modification of the bone transformations or by transferring small changes of the rest-pose geometry over the subsequent poses. Kavan et al. in FESAM [18] introduced an iterative method optimizing all of the skinning parameters in a way of supporting all arbitrary animations. Although FESAM offers high-quality reproduction results, it is limited to download-and-play scenarios because it does not use information about topology, and the location of the proxy joints becomes occluded once the optimization process begins.

Pose-to-pose skinning technique introduced by [43] which exploits temporal coherence enabling the full spectrum of applications supported by previous approaches. A novel pose editing of arbitrary animation poses in conjunction with the aforementioned pose-to-pose skinning method, are able to smoothly propagate changes through the subsequent ones generating new deformed mesh sequences. Thus, [43] enables propagation of editing for the rest of the poses while derives similar approximation error with [18].

Kavan and [18] and Le and Deng [25] constitute two other methods that have proved to be efficient in skinning approximation with the first outperforming the second by additional computational cost. Le and Deng [25] approaches a set of example poses by defining a constrained optimization problem for deriving weights and transformations which yields better results in terms of error as compared to [18] in the cost of additional computational cost. Convexity is ensured in our method by a nonnegative least square solver and an additional equation for each vertex. Then we employ linear solvers and update the weights and transformations accordingly.

An improvement of LBS is introduced by [41] presenting a post-process which handles the self-collisions of the limbs by producing skin contact effects and plausible organic bulges in real-time. This procedure is applied over a geometric skinning (such as LBS).

On the other hand, there are techniques that can create a plausible skeleton for a mesh model either by exploiting the movement of vertices to perform mesh segmentation [8], by exploiting the mesh structure by performing constrained Laplacian smoothing [4], or by analyzing the mesh structure of a set of several sparse example poses [11]. Recently, [27] presented a method that first produces a large number of plausible clusters, then reconstructs mesh topology by removing bones and finally performs an iterative optimization for joints, weights and bone transformations. Such methods are competent and produce a fully animated rigged object, but usually are semi-automated since their effectiveness and efficiency depends on setting a large number of parameters that are associated with the structure of the mesh or the specifics of kinematics.

In other methods, a set of vertex weights based on the morphology of a static mesh is predicted [28, 45]. These methods have been trained previously with static meshes and their corresponding weights from animated characters. In addition, [10] matches the mesh against a set of morphable models ending up on a method for rigging a static mesh automatically. On the other hand, our method predicts proxy bones’ weights by training with motion of the vertices frame by frame and the corresponding weights from animated characters. Thus, it is based on the vertex trajectories over time.

In addition, there are approaches that can be used in conjunction with our method to capture correctly highly deformable animations. Bailey et al. [6], Kokkevis et al. [21], Luo et al. [29], and Santesteban et al. [35] are methods that their steps’ format makes them a good fit as our method’s extensions. Bailey et al. [6] utilizing a linear system and an underlying skeleton by employing a deep learning technique to determine the nonlinear part tries to capture nonlinear deformations. Luo et al. [29] captures better nonlinear deformations by including in the animation pipeline a light weight neural network (NNWarp) that is known for its rich expressivity of nonlinear functions. Kokkevis et al. [21] provide a method which in two stages calculates the movement of articulated figures. Similarly, [35] introduces a learning-based method for realistic soft-tissue dynamics modeling as a body shape and motion function.

Moreover, [3] utilizes data from skeletons of different morphology without limitations, which can detect even the most subtle movements, where any skeletal parameterization and motion blending is possible. Motion blending also it is possible to be computed faster and with better visual outcome by using geometry algebra (GA) as opposed to dual quaternion geometry [34].

Our previous work [33] introduces a method that given an animation sequence is capable of clustering vertices using a minimal set of bones, deriving weights based on a pre-trained network with a training set of vertex trajectories pairs over time and their corresponding weights drawn from fully rigged animated characters. In this work have additionally introduced the following:

For improving performance and additionally enhancing the capability to capture and associate vertex trajectories to the most suitable bones, we have adopted a persistent labeling scheme for our training set models. Therefore, bones of the rigged models are labeled using a persistent labeling scheme that is based on the skeleton tree structure and the distance from the root.
The mesh can be segmented to areas of bone influence by utilizing the vertex to bone weight information derived by our method. Based on this information we can derive a skeletal rig for an animation sequence. Therefore, the output of our method can be converted to a fully rigged animated character and used in subsequent phases of animation editing and rendering.
Furthermore, we have introduced a process, called fusion, for combining two LBS schemes $\alpha $ and $\beta $ that have been derived with different methods. This works in two ways: (i) we derive an LBS scheme $\beta -\alpha $ with the number of bones of $\alpha $ but with improved bone to vertex associations and weights based on scheme $\beta $ or (ii) we derive an LBS scheme $\alpha -\beta $ with the number of bones of $\beta $ but with improved bone to vertex associations and weights based on scheme $\alpha $. By doing so we can take advantage of the strengths of two or more skinning schemes. To demonstrate the potential of this approach, we have combined our deep skinning method with RigNet with impressive results.

Techniques coming from computer vision could be used on image space to enhance the outcome (see e.g. [38]). Similarly, research on pose reconstruction and correction [9] can be used to obtain skeleton animations from image sequences that are compatible with our persistent bone scheme.

3 Deep skinning animation

Skinning is based on the core idea that character skin vertices are deformed based on the motion of skeletal bones. One or more weights are assigned to each vertex that represent the percentage of influence vertices receive from each bone. With this approach, we can reproduce an animation sequence based on a reference pose, the vertex weights and a set of transformations for every frame and bone. Based on this popular technique that uses only proxy bones, we have developed an algorithm which is fully automated and can produce a highly compressed skinning approximation of an animation sequence.

Figure 1 illustrates the workflow of our Deep Fusible Skinning method. We build an appropriate neural network model that assigns a set of bones and weights to each vertex by capturing vertex kinematics. Subsequently, we provide as input to our network arbitrary mesh animation sequences and predict their weights. From the per vertex classifier we determine the number of bones and the weights for each vertex (Sect. 3.2). A set of human and animal animations is used to train the neural network model. We achieve this by using as input features of the trajectories of all vertices and as output the weights that represent how each vertex is influenced by a bone. The output weight is conceived by the network as the probability of a bone to influence the corresponding vertex (Sect. 1). We use persistent labeling for bones which yields better results in terms of performance and error (Sect. 3.1). Furthermore, we introduce a fusion process, enabled by persistent bone labeling, that combines the advantages of two or more skinning methods and yields impressive results (Sect. 3.4). For all schemes, we perform optimization to minimize the least square error between the original and approximated mesh frames. We do so by optimizing weights and transformations in an iterative manner (Sect. 3.3). Finally, we offer a set of error metrics and measures to facilitate the assessment of our method and the comparative evaluation with previous competent techniques (see Sect. 4.1).

For a particular LBS scheme $\alpha $, we restrict each vertex to be associated to no more than $MAX\_BONES(\alpha )$. The value of $MAX\_BONES(\alpha )$ is usually four or six, so as to be compatible with the existing animation pipelines [17]. For simple gaming characters, usually four weights per vertex are enough, but six weights per vertex can be used to correct artifacts or capture local deformations with pseudo-bones. In our comparative evaluation, we have used six weights and we have implemented all previous methods with six weights as well, so as to conduct a fair comparison. Our network provides an output for every plausible bone. Bones are numbered from $0 \dots B_{max}-1$, where $B_{max}$ is the maximum number of bones that a character may contain and in our system this is set to 120. In our deep skinning scheme $\alpha $, the derived $MAX\_BONES(\alpha )$ weights per vertex correspond to the highest probability predictions of the network. We have experimented with two types of weights: binary weights where each weight is 0 or 1 and scalar weights where we use the scalar value of the probability prediction for each weight. We then normalize these weights to sum to 1 (coefficients of a convex combination). Since probability prediction of a vertex toward a specific bone cluster represents similarity to a training example, this is naturally translated to bone influence on vertices. Figure 2 illustrates the difference between binary and scalar weights through an example.

Table 1 Table with bone label distribution for human models

Full size table

Table 2 Table with bone label distribution for animal models

Full size table

3.1 Persistent bone labeling

We have developed a method for consistent bone labeling for the models of the training set. We do this by partitioning the bones into groups that correspond to parts of the body of the articulated character. Within each group ordering and labeling is performed according to the distance from the parent bone (i.e. bones closer to the parent bones are assigned smaller bone numbers). Numbering is carried out manually for a small subset of the training set and then a neural network is used for the rest of the models of the training set to predict the numbering of their bones. This is achieved by feeding as input to an LSTM net with one attention layer, trained by the manually numbered models, the trajectory of the centroid of each bone of the new model.

We have labeled bones consistently for several human and animal models (see Tables 1 and 2).

An example is shown in Fig. 3 of an animal and a human with bone labeling.

Persistent bone labeling facilitates fusion of LBS schemes (see Sect. 3.4 since the proxy bones assigned to the weights have a structural identification and offer the capability to detect neighboring proxy bone in the same or in different LBS schemes. We have experimentally determined that persistent bone labeling improves accuracy. Intuitively, if we do not have persistent bone labeling and bone indices are randomly assigned for each animation of the training set then the following situation may occur. Assume that we have two animations A and B that are quite similar as far as the movement of the forearm is concerned. Then when trained with animation A the trajectories of the vertices of the forearm of A are assigned to a random bone say bone 2, and when trained with animation B the trajectories of the vertices of the forearm of B are assigned to a random bone, say bone 50. When used on an animation sequence C that is not in the training data set the trajectories of half the vertices of the forearm of C may be more similar to that of animation A and will therefore be assigned to bone 2, whereas the trajectories of the remaining vertices of the forearm of C may be more similar to that of animation B and will therefore be assigned to bone 50. Moreover, the trajectories of the vertices of the head of C may well be assigned to bone 2, since they may be similar to the head movement of animation B that was assigned to bone 2. Given this situation, it is difficult to find transformations for bone 2 of animation C, to optimize both the vertices of the forearm and the head.

3.2 A neural network architecture for weight prediction

Our method adopts a supervised learning approach to leverage the power of neural networks on multi-label classification. Consequently, we utilize a neural network instead of using clustering techniques (unsupervised learning) to obtain better initial weights and bones for skinning. Thus, a classifier is utilized which is trained to classify vertex trajectories and match them to bone labels by computing the probability of each vertex to be classified to (affected by) a proxy bone. As a next step, we keep only the $MAX\_BONES(\alpha )$ proxy bones with higher probabilities and we normalize to [0, 1] (the sum of all weights for a vertex is 1). From that point on these are the weights and bones of our method. We have experimented with a variety of neural network models that can efficiently be trained to detect vertex motion patterns and mesh geometry characteristics and use similarities among them for clustering vertices into bones and determining weights implicitly by predicting the influence of bones on the mesh surface. We have chosen networks that perform well in sequence learning. To this end, we have selected and trained several network models with the training data set described in Sect. 1. The training process exhibits high accuracy and very low loss in all three different networks. More specifically, the neural networks tested are both recurrent and feed-forward. In all network cases, sequences of vertex coordinates for all animation frames are given as input.

The first network that we propose is the long short-term memory (LSTM) network which is a improved type of Recurrent Neural Network (RNN) that consists of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. These networks are created by applying the same set of weights recursively over a differentiable graph-like structure by traversing it in topological order. This makes them suitable for classifying and predicting sequence data.

The type of RNN network used is a Long Short-term Memory network firstly introduced by [12] (LSTM), which consists of units made up of a cell remembering time inconstant data values, a corresponding forget cell, an input and an output gate being responsible of controlling the flow of data in and out of the remembering component of it Figure 4. The actual difference of an LSTM compared to RNN is that LSTM has the capability of memorizing long-term dependencies regarding time data without resulting in emerging gradient vanishing problems (gradient loss exponentially decay). Not only does this capability make LSTM networks suitable for animation frame learning, but it is also a powerful way of predicting highly accurate weights.

Thus, utilization of many network units for LSTM construction (120 units used) produces a network that is able of predicting weights even for models with a large number of bones. Regarding the activation functions we used (i) an alternative for the activation function (cell and hidden state) by using sigmoid instead of tanh and (ii) the default for the recurrent activation function (for input, forget and output gate) which is sigmoid. The main reason of using the sigmoid function instead of the hyperbolic tangent is that our training procedure involves the network deciding per vertex whether it belongs or not to the influence range of a bone. This results in higher efficacy and additionally makes our model learn more effectively.

The second network that we have used successfully is a feed-forward network called Convolutional Neural Network (CNN) [19] that uses convolutional operations to capture patterns for determining classes mainly in image classification problems. CNNs are additionally able to be used in classification of sequence data with quite impressive results. On top of the two convolutional layers utilized, we have also introduced a global max-pooling layer (down-sampling layer) and a simple dense layer so that we have the desirable number of weights for each proxy bone. In the two convolutional layers (Conv1D) used, we utilize 8 filters of kernel size 2. The number of filters and kernel size have been determined experimentally. A small kernel size works better because it is able to capture the transitions from one frame to the next with high accuracy since there is only a small vertex movement between two consecutive frames. The last network that we have considered for completeness is a hybrid neural network that is a combination of the two aforementioned networks with some modifications.

All networks take as input an arbitrarily large sequence that represents the trajectory of a vertex, i.e. the (x, y, z) position at each frame, and predict the bone weights for this vertex. The first issue that we faced was to select the appropriate network architecture for our problem. Therefore, we had to determine two factors: (i) the batch size that yields the better trade-off between effectiveness and efficiency and (ii) the network that provides the best generalization as the training set increases (i.e. has the best behavior for predicting weights of random test sequences).

Firstly, we have conducted experiments with the entire training dataset for all networks for different batch sizes (1024, 2048, 4096, 8192). We have determined that the best trade-off between effectiveness and efficiency is provided by batch size 4096 for all networks. Secondly, we have conducted a set of experiments for determining the best choice of network, usage of bone labeling and binary vs scalar weights. We used several training sets on all different network architectures (LSTM, CNN, Hybrid).

More specifically, we created four different training sets (all subsets of the training set described in Sect. 1) with 9, 13, 17 and 24 models, which had both animals and humans inside. The 9 model set with contains 5 people and 4 animals. The 13 model set contains 7 people and 6 animals. The 17 model contains 9 people and 8 animals. Finally, the 24 model set contains 12 people and 12 animals.

Then, we defined four categories which are characterized by the use or not of bone labeling and by the choice of scalar vs binary weights.

3.3 Transformation and weight optimization

Approximating an animation sequence to produce a more succinct representation is common in the case of articulated models and is carried out through a process called skinning. Skinning approximates the motion of vertices based on the kinematics of the bones that influence them. This means that bone-vertex relations need to be established, meaning, which bone affects which vertices and what is the amount of bone influence which is expressed by the corresponding skin to bone weights.

For every vertex $v_i$ that is influenced by a bone $j$, a weight $w_{ij}$ is assigned. For skeletal rigs, the skeleton and skin of a mesh model are given in a predetermined pose also known as bind or rest pose. The rigging procedure blends the skeleton with skin which is given by the rest pose of the model. Each transformation is the concatenation of a bind matrix that takes the vertex into the local space of a given bone and a transformation matrix that moves the bone to a new position.

In LBS, the new position of vertex ${v'}\,_i^{p}$ at pose (frame) p is given by Eq. 1. This approach corresponds to using proxy bones instead of the traditional hierarchical bone structure on rigid or even on deformable models [17].

$$\begin{aligned} {v'}\,_{i}^{p} = \sum _{j=1}^{B} w_{ij}\cdot T_{j}^{p} \cdot v_i \end{aligned}$$

(1)

In this equation, $v_i$ represents the position of the vertex in rest pose, $w_{ij}$ the weight by which bone j influences vertex $v_i$ and $T_{j}^{p}$ is the transformation that is applied to bone j during frame p. Intuitively, proxy bones act as attractors and the weights represent the intensity of their attraction. Each vertex can be attracted by different bones and each bone attracts each vertex in its area of influence with different intensity. This setting triggers several caveats when distributing proxy bones and assigning weight-influences.

Computing a good initial set of weights is a key step for the final result. In deep skinning, a neural network provides the proxy bones and initial weights that are appropriate for an animation sequence. After that, we perform a successive optimization to find weights and proxy bone transformations. Both problems are formulated as least squares optimization problems that minimize the quantity given in Eq. 2.

$$\begin{aligned} \sum _{i=1}^{N} {\Vert {v'}\,_{i}^{p} - v_{i}^{p}\Vert }^2 \end{aligned}$$

(2)

where $v_{i}^{p}$ denotes the coordinates of the original vertex in pose p, ${v'}\,_{i}^{p}$ is the approximation based on deep skinning and N is the number of vertices in the model. For the following, the number of vertices is N, the number of frames is P, the number of proxy bones is B and $M=MAX\_BONES(\alpha )$ for our LBS scheme $\alpha $. To solve the weight optimization problem, we formulate the system $A{\textbf {x}}=b$, where matrix A is a $3PN \times MN$ matrix constructed by combining the rest-pose vertex positions and the corresponding transformations, ${\textbf {x}}$ is a MN vector of unknowns that contains the weights and b is a known 3PN vector that consists of the original (target) vertex coordinates in all frames. However, since each vertex has its own weights we solve N systems $3P \times M$ with M unknowns each, where the right part is a 3P vector. In the case of finding the optimal weights, we include the convexity coefficient requirement as an extra equation per vertex (so A is a $(3P+1) \times M$ matrix and b is a $3P+1$ vector).

Finally, to solve the transformation optimization problem, we formulate a linear system that consists of 3N equations, the unknowns of which are the (3x4) elements of the transformation matrices $T^p_j$ of each bone j and frame p. This sums to 12BP unknowns. The system can be expressed as $A {\textbf {x}} = b$, where A is a $3NP \times 12BP$ known matrix constructed by combining the rest-pose vertex positions and the corresponding vertex weights. Moreover b is a known 3NP vector that contains the original (target) vertex coordinates. Since the transformations for each pose are different we can solve P linear systems $A {\textbf {x}} = b$, where A is a $3N \times 12B$ matrix, x is a vector with 12B unknowns and b is a 3N vector with the target coordinates for each pose.

To avoid reverting into nonlinear solvers, we alternatively optimize weights and transformations separately. In terms of optimization, Vasilakis et al. [43] uses NNLS (nonnegative least square) optimization for enforcing the convexity condition of the weights. We express the convexity by a separate equation per vertex which is closer to the approach adopted by [18]. Both [43] and [18] use an iterative process where weights and transformations are optimized separately (coordinate descent). Vasilakis et al. [43] suggests that 5 iterations are enough, whereas [18] employs 15 iterations. We have performed experiments for up to 50 iterations and our conclusion is that after 10 to 15 iterations there is no significant error improvement. To perform the optimization problem, we have employed conjugate gradient optimization which works better on multidimensional variable spaces and can be carried out efficiently on the GPU.

3.4 Fusion of LBS schemes

We introduce a novel approach for fusing the bones and weights of two LBS schemes. We present a method that modifies the vertex weights of one LBS scheme (scheme $\alpha $) without affecting the original number of bones by utilizing information provided by a second LBS scheme (scheme $\beta $). By doing so, we create a new LBS scheme, which is denoted as scheme $\beta $-$\alpha $ that has the same number of bones as scheme $\alpha $ and a distribution of weights determined by combining information from both schemes $\alpha $ and $\beta $.

Note that $\beta -\alpha $ scheme maintains the number of bones and the bandwidth requirements of $\alpha $ scheme and $\alpha -\beta $ the number of bones of $\beta $. Thus, the selection of the fused scheme depends on either which method ($\alpha $ or $\beta $) is more accurate and/or which one has the smallest number of bones and therefore the lowest bandwidth requirements for streaming. For example, consider two schemes $\alpha $ and $\beta $ such that the $\alpha $ scheme has less bones and the $\beta $ scheme has better accuracy with more bones. If someone requires better accuracy for higher bandwidth should pick the fused scheme $\alpha -\beta $, and if someone requires lower bandwidth should use the fused scheme $\beta -\alpha $.

After the selection of the fusion, the final scheme undergoes a fitting process for finding the optimal set of weights and transformation. This optimizes significantly LBS scheme $\alpha $ by using information from LBS scheme $\beta $ in two ways:

alters the initial positioning of weights of scheme $\alpha $ by using information from scheme $\beta $ and therefore helps discover better local minima.
adds additional vertex-bone connections to scheme $\alpha $ up to the maximum number of bones for scheme $\alpha $.

We introduce the following notation for representing an LBS scheme $\alpha $. An LBS scheme is represented by a bipartite graph $\alpha = (V, B_\alpha , W_\alpha )$ where V is the set of vertices, B is the set of bones and $W_\alpha $ is the set of edges that associate a vertex with a bone, i.e. $W_\alpha $ contains pairs (v, b) where $v \in V$, $b \in B_\alpha $. Each weight edge has a label that is given by a function $w_\alpha (v,b)$ that represents the weight value that connects v and b.

We denote an animation scheme as a pair $(\alpha , F, T)$ where F is a set of frames and T is a function that associates every pair (b, f) where $b \in B_\alpha $ and $f \in F$ to the corresponding transformation T(b, f), i.e. the transformation of bone b at frame f.

For every animation scheme $\alpha = (V, B_\alpha , W_\alpha )$, the following hold:

Let $B_\alpha (v)$ be the set of bones associated with a vertex v, then the cardinality of this set is always bounded by a parameter that is fixed for this LBS scheme, the number of weights per vertex $MAX\_BONES(\alpha )$: $|B_\alpha (v)| \le MAX\_BONES(\alpha )$.

For a vertex $v \in V$ the sum of all weights is equal to 1: $\sum _{b:\,(v,b) \in W_\alpha }{w(v,b)}=1$ and $w(v,b) > 0$.

Algorithm 1 describes the process of deriving a new LBS scheme $\beta -\alpha $ by altering the weights and bone-weight associations of scheme $\alpha $ by using information form scheme $\beta $.

We have established experimentally that when the two schemes contain different information for vertex to bone association, the new scheme combines this information and therefore has significantly lower error.

To this end, we have tested our algorithm by using LBS scheme Rignet in combination to our DS scheme. The Rignet LBS scheme is derived from the output of the method introduced by Zhou et al. in [45] that derives a skeleton from a static mesh and a set of weights. From this set of weights and bones, we derive an LBS scheme by simply computing the optimal LBS transformation.

Therefore, we have combined an LBS scheme that has been derived by a static mesh analysis (Rignet) and our LBS scheme that is derived by clustering vertices based on their trajectories (DS) to derive two new schemes namely the DS-Rignet scheme and the Rignet-DS scheme. The DS-Rignet scheme has the same set of bones as Rignet with an improved set of weights based on the temporal mesh analysis derived by DS. The Rignet-DS scheme has the same set of bones as DS with improved weights based on the static mesh analysis performed by Rignet. The results are reported in Sect. 4.2.

4 Experimental evaluation of deep fusible skinning

One of the main contributions of our work is that it expresses a combinatorial optimization problem with constraints as a classification problem and then proposes a method to solve it using deep learning techniques.

The entire method was developed using Python and Tensorflow [1] under the Blender 2.79b scripting API. The training part runs on an Intel Core i7-4930K 3.4GHz system with 48Gb RAM under Windows 10 64-bit operating System, with an NVIDIA GeForce RTX 2080Ti GPU with 11GB GDDR6 RAM. We trained our network models with Adam Optimizer [20], ${\text {learning Rate}}=0.001$ for $20-100$ epochs with batch size 4096 over a training dataset that incorporates 67 animated character models of different size in terms of number of vertices, animations and frames per animation. We have inferred that 20 epochs are usually enough for convergence in terms of the error metrics and most importantly toward an acceptable visual outcome. However to obtain better ERMS and distortion errors without over-fitting 100 epochs is a safe choice independently of the training set size. Furthermore, with this choice of batch size, we overcome the over-fitting problem that was apparent by observing the Max Average Distance metric and was manifested by locally distorted meshes.^{Footnote 1}

Table 3 The two smallest and largest examples in terms of vertices from the human and animal training sets

Full size table

The network model is initially fit on a training dataset, that is a set of examples used to adjust the parameters of the model. The network is trained on the training set using a supervised learning method. Table 3 provides information for the two largest and smallest models in terms of vertices among the entire set that have been used for training.

The training dataset consists of input vector pairs that represent the motion of each vertex through all frames and the corresponding output vector of labels which determines whether a vertex is influenced by a specific bone. The input vector size is $(3 \cdot F)$, where 3 represents the x, y, z coordinates of a vertex and F the number of frames for the specific animation and the output consists of $B_{max}$ labels, where $B_{max}$ is the maximum number of bones. The current network model is fed with the training dataset and produces a result, which is then compared with the label vector, for each input vector in the training set. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted (supervised learning).

We have used two types of training datasets, one that consists of human character animations and one that comprises animal character animations. The animal dataset contains 32 animated animal characters with an average number of 12k vertices each, an average number of 3 animations per character and an average number of 195 frames per animation. The human dataset contains 35 animated human characters with an average number of 10k vertices each, 1 animation per character, and an average number of 158 frames per animation.

During training, the fitted network model is used to predict the response of observations in a second smaller dataset called the validation set. This set consists of 20% randomly chosen example pairs of the training dataset and is used to tune the hyper-parameters of the network. The training time is an important characteristic, especially if we wish to expand our training dataset in the future. Figure 5 indicates the time that is needed for training using LSTM network (Sect. 3.2) showing how the training time increases based on changing the number of models (of average vertices, animations and per animation number of frames as described above), number of vertices and frames for additionally presenting a more detailed picture of how training time is affected by these two factors. After a complete training session, the trained network model is used in our deep skinning method to predict bones and weights for a given animation sequence.

The rest of our algorithm (prediction and optimization) was tested on the same system without the use of the GPU. In addition, the FESAM algorithm was developed and tested on the same system.

Evaluation of our method is estimated by conducting a thorough experimental study to substantiate the effectiveness of our method based on the resulting error and performance metrics (quantitative results) along with visual outcome based metrics (qualitative results).

In the first category, we employed two well-known error metrics: Distortion Percentage DisPer and Root-Mean-Square Error ERMS. Moreover, we have used two performance-related metrics: Bandwidth needed for streaming an LBS scheme and Compression Rate of the original animation sequence when being represented by an LBS scheme.

Furthermore, we present qualitative results demonstrating the visual quality of our results with two metrics that we have introduced in this paper: Maximum Average Distance MaxAvgDist which represents the maximum average vertex misplacement across frames and Normal Distortion which estimates the average distance of the original face normals from the approximated ones. All the metrics are represented thoroughly in refsubsec:errors.

4.1 Error, compression and bandwidth metrics and measures

We have used three different types of measures to calculate error of the approximation methods. The first two measures are standard measures used in [14, 17, 18]. The first error measure is the percentage of deformation known as distortion percentage (DisPer).

$$\begin{aligned} DisPer = 100 \cdot \frac{\Vert A_{orig} - A_{Approx}\Vert _F}{\Vert A_{orig} - A_{avg}\Vert _F}. \end{aligned}$$

(3)

where $\Vert \cdot \Vert _F$ is the Frobenius matrix metric. In Eq. 3$A_{orig}$ is a 3NP matrix which consists of the real vertex coordinates in all frames of the model. Similarly, $A_{Approx}$ has all the approximated vertex coordinates and matrix $A_{avg}$ contains in each column, the average of the original coordinates in all frames. Kavan [18] replaces 100 by 1000 and divides by the surrounding sphere diameter. Sometimes this measure tends to be sensitive to the translation of the entire character, therefore we use a different measure that is invariant to translation. The root-mean-square (ERMS) error measure in Eq. 4 is an alternative way to express distortion with the difference that we use $\sqrt{3NP}$ in the denominator so as to obtain the average deformation per vertex and frame during the sequence. $3NP$ is the total number of elements in the $A_{orig}$ matrix. [27] uses as denominator the diameter of the bounding box multiplied by $\sqrt{NP}$.

$$\begin{aligned} {\text {ERMS}}= 100 \cdot \frac{\Vert A_{orig} - A_{\text {Approx}}\Vert _F}{\sqrt{3NP}} \end{aligned}$$

(4)

We introduce a novel error measure, namely the max average distance (MaxAvgDist) given by Eq. 5) which is a novel quality metric that reflects better the visual quality of the result. Max distance denotes the largest vertex error in every frame. So this measure represents the average of max distances over all frames.

$$\begin{aligned} MaxAvgDist = \frac{1}{P}\sum _{f=1}^{P}\max _{i=1,...,N}{\Vert v_{orig}^{f,i} - v_{\text {Approx}}^{f,i}\Vert } \end{aligned}$$

(5)

Finally, we introduce an additional measure that characterizes the normal distortion - (NormDistort) and is used to measure the different behavior of two animation sequences during rendering. We compute the average difference between the original and the approximated face normals by the norm of their cross product that equals to the sine of the angle between the two normal vectors. Therefore for a model with F faces and P frames, where $NV^{i,j}$ is the normal vector of face j at frame i, Eq. 6 computes the normal distortion measure.

$$\begin{aligned} \begin{aligned}&NormDistort = \\ {}&sin^{-1}\bigg (\frac{1}{FP} \sum _{i=1}^{P}\sum _{j=1}^{F}{||NV^{i,j}_{orig} \times NV^{i,j}_{Approx}||}\bigg ) \end{aligned} \end{aligned}$$

(6)

Additionally, we present two measures that characterize the space efficiency of our approach.

The compression rate express the percentage reduction in space requirements for representing an animation sequence with an LBS scheme as opposed to store the entire mesh sequence. For the computations, we use double precision arithmetic.

$$\begin{aligned} {\text {CompressionRate}} = 100 \, \frac{{\text {Original}} - {\text {LBS}}}{\text {Original}} \end{aligned}$$

(7)

Compression rate is the percentage of less bytes needed by the original ones that we achieve by utilizing only the approximated transformations and weights (proxy bones in Deep Skinning) instead of utilizing the whole 3D animation. This metric is calculated as in 7 equation which has been derived from the difference of $Original = 24 N P$ bytes needed for a full 3D animation sequence minus the space requirements for the LBS scheme which is the sum of 24N (vertices), 96BP (transformations) and 9MN (weights), $LBS=24N + 96BP + 9NM$ divided by Original bytes and multiplied with 100, so that it is converted to percentage. Recall that $M=MAX\_BONES(\alpha )$ is the number of bones per vertex for our LBS scheme $\alpha $.

The bandwidth is the required network bandwidth so that we can stream an animation.

For an LBS scheme with a frame duration $t_F$, we need per frame to stream all B transformations. Each transformation has $12*64 = 768$ bits for double precision arithmetic.

$$\begin{aligned} {\text {Bandwidth}}({\text {LBS}}) = \frac{768\,B}{t_F} \, bps \end{aligned}$$

(8)

For a full animation sequence, the required bandwidth is computed by the space required for all vertices with double precision arithmetic which is $3*64\, N$ bits.

$$\begin{aligned} {\text {Bandwidth}}({\text {FULL}}) = \frac{192\,N}{t_F} \, bps \end{aligned}$$

(9)

In our experiments, we have used a typical $t_F = \frac{1}{24} sec$

4.2 Quantitative results

Our training dataset incorporates several human and animal models with a variety of animations, number of vertices and frames. In our case, we created a single dataset with animals and humans (see Sect. 1). As far as the test set is concerned, the details are reported in Sect. 1.

The optimization part of our algorithms is explained in Sect. 3.3.

Figures 6 and 7 provide a comparative evaluation of our method with the most competent skinning approach FESAM. The original FESAM algorithm follows three steps of optimization with the first step being the process of optimizing the initial pose something that is not compatible with the traditional animation pipelines. For that reason. this step is not included in our experiments and subsequently we use the FESAM-WT approach with two steps (weight and transformation optimization). More specifically, we evaluate the performance of our approach with distortion and RMS errors for the four cases of bone labeling and binary vs scalar weights as compared to FESAM-WT. Based on the results, the bone labeling with scalar weights variation of our method performs overall better among the three variation on both animal and humans characters.

Table 4 Comparative evaluation of our method versus Method I [14], Method II [17], Method III [18], Method IV [36], Method V [24]

Full size table

Table 4 is a comparison of our method with other similar methods producing LBS schemes with pseudo-bones when presented with several benchmark animation sequences from literature. More specifically, Table 4 presents a comparison of our method with four benchmark animation sequences, that were not produced by fully animated rigs, with all previous combinations of LBS, quaternion-based and SVD methods. N is the number of vertices, F is the number of frames and the number in round brackets is the result of the method combined with SVD. Our method derives better results in terms of both error and compression rate as compared to methods I-V. Method [2] is only cited for reference since it yields impressive results in terms of error by increasing significantly the parameter space of the problem but is not compatible with any of the standard animation pipelines.

Moreover, Fig. 8 shows the speed up that we have achieved in the fitting time by using the conjugate gradient method which is more efficient in multi-dimensional problems such as the ones that we are solving (12BP variables for transformations, and MN variables for weights)

For computing the error, we have used a small test dataset with representative human characters and a small testing dataset with representative animal models (all subsets of the test dataset described in Sect. 1). Tables 5 and 6 show the results for the smallest and the largest of training dataset for all networks and for batch size 4096. The experiments show that as the dataset increases, the network that behaves better is the LSTM network with bone labeling and scalar weights. This network is capable of handling effectively both human and animal characters.

Table 5 Comparison between bone labeling—scalar/binary weights for animal models

Full size table

Table 6 Comparison between bone labeling - scalar/binary weights for human models

Full size table

Finally, we have conducted experiments for studying the performance of the outcome of fusing two LBS schemes (see Sect. 3.4). To this end, we have fused Deep Skinning method (DS) and Rignet and we have determined the performance of the fused LBS schemes: Rignet-DS and DS-Rignet. These schemes have yielded lower errors outperforming both Rignet and DS. This is illustrated in Fig. 9 where the bandwidth requirements (see Sect. 4.1) are also visualized to give us a better understanding of the overall efficiency. Recall that Rignet and DS-Rignet have the same number of bones (the bones produced by Rignet). Likewise, DS and Rignet-DS have the same number of bones (the bones produced by DS). Therefore, Rignet and DS-Rignet are worse in terms of bandwidth consumption because Rignet uses in average four times more bones as compared to DS. However, DS-Rignet exhibits less error than Rignet, and Rignet-DS exhibits less error than DS. Therefore, fusing improves the two fused LBS schemes by utilizing information from both schemes.

4.3 Visual quality evaluation results

In computer graphics, qualitative results (visual and otherwise) are an important means of assessing a novel method. In this section, we present three processes for assessing the visual quality of deep skinning. Finally, we assess the visual quality of the fused schemes.

4.3.1 Quality measure evaluation

We use the MaxAvgDist quality assessment measure. This is a measure that indicates how far in terms of visual quality the generated frames are from the original frame sequence (see Sect. 4.1). Low measure values correspond to high quality animation.

Figure 10 suggests that bone labeling with scalar weights yields results with better quality measure as compared to FESAM-WT. Figures 10a and 10b confirm the quantitative results.

4.3.2 Visualization-based evaluation

As an additional assessment criterion for our method, we provide an illustration of the visual outcome. By using the term visual outcome, we refer to the approximated output frames as compared to the frames of original 3D model. After conducting several experiments, we have observed that our approach seems to approximate better the original model. In every case, there is a noticeable difference between deep skinning and FESAM-WT. To this end, a demonstration video is also provided as supplementary material.

Figures 11 and 12 illustrates the differences of the two approximation methods as compared to the original model animation. Several frames have been selected with noticeable structural flaws.

Error visualization techniques can provide an insight for the parts where errors occur. We use the turbo colormap [31] to represent the error per vertex which is color blind friendly. This error is the distance in a particular frame of the approximated vertex from the original one. Figure 13 illustrates the per vertex error in a particular frame for deep skinning and FESAM_WT. For that reason, we have developed a visualization tool so that we can visualize the error measures.

4.3.3 Lighting quality evaluation

One step closer to evaluating visual outcome is additionally how our 3D animated model reacts with lighting and not as a 3D mesh only. Therefore, we offer the results of evaluating the average distortion of normal vectors. The normal distortion measure (see section 4.1) shows how close the normal vectors of the approximated sequence are to the normal vectors of the original animation sequence. This determines how the approximated character will behave in an lighting model as compared to the original animated character. The results of Fig. 14 exhibit an average error of 0.01 radians for human characters and an average error of 0.05 radians for animal characters.

4.3.4 Quality comparison of fused schemes

The key value of the fusing process that it fuses two information sources (LBS schemes) for a more accurate visual outcome. To this end, we have fused the outcome of Rignet with the outcome of our DS method. Rignet provides information of bones that have been generated from the static mesh. Deep Skinning detects bones by clustering vertices using their trajectories. Rignet usually produces more bones as compared to what DS generates as proxy bones. DS contains useful vertex clustering information deduced from the vertex trajectories.

The fused scheme DS-Rignet maintains the large number of bones from Rignet and leverages the dynamic information from DS. The Rignet-DS scheme uses the bone structure of DS and the vertex clustering suggested by the properties of the static mesh. The visual quality of the four schemes is illustrated in Fig. 15.

4.4 Discussion and applications

We have presented a method called Deep Skinning that feeds an animation sequence with no underlying skeleton or rigging/skinning information to a pre-trained neural network to generate an approximated compressed skinning model with pseudo-bones. This type of animation representation can be used subsequently for a variety of applications.

An animation sequence consists of mesh models for every frame. To edit this mesh sequence, an experienced animator is required to put a lot of effort. On the other hand, editing a skeleton-based rigged animation can be accomplished with a multitude of available open and proprietary tools with minimum effort.

Table 7 Comparison between Temporal Deep Skinning and four methods. Specifically Method A [27], Method B [37], Method C, Method D [11]

Full size table

To this end, we have developed a post-processing tool that using the compressed skinning model with pseudo-bones and per frame transformations obtained by deep skinning produces the corresponding hierarchical skeleton, skinning data and transformations. More specifically, using the mesh clustering derived by our method, the pseudo-bones and transformations we produce a fully animated character model. This is accomplished by the following steps: (i) perform weight regularization and derive disjoint vertex clusters that are influenced by each bone, (ii) based on the neighboring clusters export the joints of the entire model (the structure of the skeleton) [26] and (iii) finally perform joint location adjustment by geometric constraints and a simple recalculation of orientation and rotation for each of the joints that yields their final position [5].

Figure 16 illustrates the original and approximated representation of a 3D model. This animated model consisting of 14, 007 vertices and 4, 669 faces was approximated by the deep skinning algorithm with 24 bone clusters and up to six weights per vertex.

Moreover, Fig. 17 presents the computed bones and weights for an animation sequence. This animation sequence consists of 48 different frames from the horse-gallop sequence. After the Deep Skinning algorithm, we were able to extract 19 bone clusters and up to six weights per vertex. Subsequently, we have produced a fully animated character.

Finally, Table 7 provides a comparison of our method with four methods that produce actual skeletal rigs. In this case of 7, we cite the results from the papers since such methods are difficult to reproduce and this goes beyond the scope of this paper. For two models (horse gallop and samba), we have measured the ERMS error and the compression rate percentage (CRP). Note that the results of [27] were converted to our ERMS metric by multiplying by $\frac{D}{\sqrt{3}}$, where D is the diagonal of the bounding box of the rest pose.

For the horse gallop model, our method approximates the sequence by using 26 bones and achieves a smaller ERMS error as compared to all previous methods. For the samba model, our method uses 17 bones and outperforms all previous methods.

5 Conclusions and future work

In a nutshell, in this work, we propose a classification technique that derives pseudo-bones and their corresponding weights for an animated sequence by matching vertex trajectories to proxy bones and computing the probability of a vertex to be affected by a proxy bone. To this end, we train a deep neural network with a set of rigged animated characters whose bones have been manually labeled. Moreover, we introduce a new fusing algorithm that is capable of combining different LBS schemes followed by the fitting process for extracting bone to vertex weights and transformations approximating the original 3D animation sequence.

Conducting a comparative evaluation, we conclude that our method with consistent labeling accomplishes lower approximation errors from its previous methods. To demonstrate the usefulness of our fusing algorithm, we have fused the outcome of Rignet with the outcome of our Deep Skinning method with impressive results.

Furthermore, efficiency of weight and transformation fitting has improved by using the conjugate gradient optimization method. Finally, with the bone labeling technique, we have further improved the overall approximation error.

Our future work aims on utilizing this setting for animation synthesis of either totally new animations or expanding/extending the existing ones. This is possible by leveraging the versatility of our fusing algorithm.

Data availibility

The data that support the findings of this study are openly available in https://github.com/AnastasiaMoutafidou/FusibleSkinning.

Notes

https://github.com/AnastasiaMoutafidou/FusibleSkinning.

References

Abadi, M. et al: Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283 (2016). https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
Alexa, M., Müller, W.: Representing animations by principal components. Comput. Graph. Forum (2000). https://doi.org/10.1111/1467-8659.00433
Article Google Scholar
Andreou, N., Aristidou, A., Chrysanthou, Y.: Pose representations for deep skeletal animation. Computer Graphics Forum. 41, 155–167 (2022). https://doi.org/10.1111/cgf.14632
Article Google Scholar
Au, O.K.C., Tai, C.L., Chu, H.K., Cohen-Or, D., Lee, T.Y.: Skeleton extraction by mesh contraction. ACM Trans. Graph. 27(3), 44:1-44:10 (2008)
Article Google Scholar
Avril, Q., Ghafourzadeh, D., Ramachandran, S., Fallahdoust, S., Ribet, S., Dionne, O., de Lasa, M., Paquette, E.: Animation setup transfer for 3d characters. In: Proceedings of the 37th Annual Conference of the European Association for Computer Graphics, EG ’16, pp. 115–126. Eurographics Association, Goslar, DEU (2016)
Bailey, S.W., Otte, D., Dilorenzo, P., O’Brien, J.F.: Fast and deep deformation approximations. ACM Trans. Graph. (2018). https://doi.org/10.1145/3197517.3201300
Article Google Scholar
CGTrader: 3d models for vr/ar and cg projects (2022). https://www.cgtrader.com/
De Aguiar, E., Theobalt, C., Thrun, S., Seidel, H.P.: Automatic conversion of mesh animations into skeleton-based animations. Comput. Graph. Forum 27(2), 389–397 (2008). https://doi.org/10.1111/j.1467-8659.2008.01136.x
Article Google Scholar
Fan, C., Fu, W., Liu, S.: A high-precision correction method in non-rigid 3d motion poses reconstruction. Connect. Sci. 34(1), 2845–2859 (2022). https://doi.org/10.1080/09540091.2022.2151569
Article Google Scholar
Feng, A., Casas, D., Shapiro, A.: Avatar reshaping and automatic rigging using a deformable model. In: Proceedings of the 8th ACM SIGGRAPH Conference on Motion in Games, MIG ’15, pp. 57–64. Association for Computing Machinery, New York, NY, USA (2015). https://doi.org/10.1145/2822013.2822017
Hasler, N., Thormählen, T., Rosenhahn, B., Seidel, H.P.: Learning skeletons for shape and pose. In: Proceedings of the 2010 ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D ’10, pp. 23–30. Association for Computing Machinery, New York, NY, USA (2010). https://doi.org/10.1145/1730804.1730809
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Jacobson, A., Deng, Z., Kavan, L., Lewis, J.P.: Skinning: Real-time shape deformation. In: ACM SIGGRAPH 2014 Courses, SIGGRAPH ’14. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2614028.2615427
James, D.L., Twigg, C.D.: Skinning mesh animations. In: ACM SIGGRAPH 2005 Papers, SIGGRAPH ’05, pp. 399–407. Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10.1145/1186822.1073206
Kakadiaris, I., Metaxas, D.: Vision-based animation of digital humans. In: Proceedings Computer Animation ’98 (Cat. No.98EX169), pp. 144–152 (1998). https://doi.org/10.1109/CA.1998.681919
Kavan, L., Collins, S., Žára, J., O’Sullivan, C.: Skinning with dual quaternions. In: Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games, I3D ’07, pp. 39–46. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1230100.1230107
Kavan, L., McDonnell, R., Dobbyn, S., Žára, J., O’Sullivan, C.: Skinning arbitrary deformations. In: Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games, I3D ’07, pp. 53–60. Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1230100.1230109
Kavan, L., Sloan, P.P., O’Sullivan, C.: Fast and efficient skinning of animated meshes. Computer Graphics Forum 29, 327–336 (2010). https://doi.org/10.1111/j.1467-8659.2009.01602.x
Article Google Scholar
Khan, A., Sohail, A., Zahoora, U., Qureshi, A.S.: A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. (2020). https://doi.org/10.1007/s10462-020-09825-6
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), May 7–9, 2015, San Diego (2015)
Kokkevis, E., Metaxas, D., Badler, N.: User-controlled physics-based animation for articulated figures. In: Proceedings Computer Animation ’96, pp. 16–26 (1996). https://doi.org/10.1109/CA.1996.540484
Kraevoy, V., Sheffer, A.: Cross-parameterization and compatible remeshing of 3d models. In: ACM SIGGRAPH 2004 Papers. SIGGRAPH ’04, pp. 861–869. ACM, New York, NY, USA (2004)
Kry, P.G., James, D.L., Pai, D.K.: Eigenskin: Real time large deformation character skinning in hardware. In: Proceedings of the 2002 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’02, pP. 153–159. Association for Computing Machinery, New York, NY, USA (2002). https://doi.org/10.1145/545261.545286
Le, B., Deng, Z.: Smooth skinning decomposition with rigid bones. ACM Trans. Graph. 31(6) (2012). In press
Le, B.H., Deng, Z.: Smooth skinning decomposition with rigid bones. ACM Trans. Graph. 31(6), 199:1-199:10 (2012)
Article Google Scholar
Le, B.H., Deng, Z.: Smooth skinning decomposition with rigid bones. ACM Trans. Graph. (2012). https://doi.org/10.1145/2366145.2366218
Article Google Scholar
Le, B.H., Deng, Z.: Robust and accurate skeletal rigging from mesh sequences. ACM Trans. Graph. (2014). https://doi.org/10.1145/2601097.2601161
Article Google Scholar
Liu, L., Zheng, Y., Tang, D., Yuan, Y., Fan, C., Zhou, K.: Neuroskinning: automatic skin binding for production characters with deep graph networks. ACM Trans. Graph. (2019). https://doi.org/10.1145/3306346.3322969
Article Google Scholar
Luo, R., Shao, T., Wang, H., Xu, W., Chen, X., Zhou, K., Yang, Y.: Nnwarp: neural network-based nonlinear deformation. IEEE Trans. Vis. Comput. Graph. 26(4), 1745–1759 (2020)
Google Scholar
Magnenat-Thalmann, N., Laperrière, R., Thalmann, D.: Joint-dependent local deformations for hand animation and object grasping. In: Proceedings on Graphics Interface ’88, pp. 26–33. Canadian Information Processing Society, CAN (1989)
Mikhailov, A.: Turbo, An Improved Rainbow Colormap for Visualization, Google AI Blog, August 20, 2019 (2019)
Mixamo: empowering creativity with animated 3d characters (2022). https://www.mixamo.com
Moutafidou, A., Toulatzis, V., Fudos, I.: Temporal parameter-free deep skinning of animated meshes. In: Advances in Computer Graphics: 38th Computer Graphics International Conference, CGI 2021, Virtual Event, September 6–10, 2021, Proceedings, pp. 3–24. Springer-Verlag, Berlin, Heidelberg (2021). https://doi.org/10.1007/978-3-030-89029-2_1
Papagiannakis, G.: Geometric algebra rotors for skinned character animation blending. In: SIGGRAPH Asia 2013 Technical Briefs, SA ’13. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2542355.2542369
Santesteban, I., Garces, E., Otaduy, M.A., Casas, D.: Softsmpl: data-driven modeling of nonlinear soft-tissue dynamics for parametric humans. Comput. Graph. Forum 39(2), 65–75 (2020). https://doi.org/10.1111/cgf.13912
Article Google Scholar
Sattler, M., Sarlette, R., Klein, R.: Simple and efficient compression of animation sequences. In: Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’05, pp. 209–217. Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10.1145/1073368.1073398
Schaefer, S., Yuksel, C.: Example-based skeleton extraction. In: Proceedings of the Fifth Eurographics Symposium on Geometry Processing, SGP ’07, pp. 153–162. Eurographics Association, Goslar, DEU (2007)
Shen, Ww., Chen, L., Liu, S., Zhang, Y.D.: An image enhancement algorithm of video surveillance scene based on deep learning. IET Image Process. 16(3), 681–690 (2022). https://doi.org/10.1049/ipr2.12286
Article Google Scholar
Sketchfab: publish and find 3d models online (2022). https://sketchfab.com
TurboSquid: 3d models for professionals (2022). https://www.turbosquid.com/
Vaillant, R., Barthe, L., Guennebaud, G., Cani, M.P., Rohmer, D., Wyvill, B., Gourmel, O., Paulin, M.: Implicit skinning: real-time skin deformation with contact modeling. ACM Trans. Graph. 32(4), 125:1-125:12 (2013)
Article Google Scholar
Vasilakis, A., Fudos, I.: GPU rigid skinning based on a refined skeletonization method. J. Vis. Comput. Anim. 22(1), 27–46 (2011). https://doi.org/10.1002/cav.382
Article Google Scholar
Vasilakis, A.A., Fudos, I., Antonopoulos, G.: PPS: pose-to-pose skinning of animated meshes. In: Proceedings of the 33rd Computer Graphics International, CGI ’16, pp. 53–56. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2949035.2949049
Wareham, R., Lasenby, J.: Bone glow: An improved method for the assignment of weights for mesh deformation. In: Proceedings of the 5th International Conference on Articulated Motion and Deformable Objects, AMDO ’08, pp. 63–71. Springer-Verlag, Berlin, Heidelberg (2008)
Xu, Z., Zhou, Y., Kalogerakis, E., Landreth, C., Singh, K.: Rignet: neural rigging for articulated characters. ACM Trans. Graph. 39(4), 14 (2020). https://doi.org/10.1145/3386569.3392379
Zell, E., Botsch, M.: Elastiface: Matching and blending textured faces. In: Proceedings of the Symposium on Non-Photorealistic Animation and Rendering, NPAR ’13, pp. 15–24. ACM, New York, NY, USA (2013)

Download references

Acknowledgements

This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme "Human Resources Development, Education and Lifelong Learning" in the context of the project "Strengthening Human Resources Research Potential via Doctorate Research" (MIS-5000432), implemented by the State Scholarships Foundation (IKY).

Funding

Open access funding provided by HEAL-Link Greece. The publication of the article in OA mode was financially supported by HEAL-Link.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Ioannina, Ioannina, Greece
Anastasia Moutafidou, Vasileios Toulatzis & Ioannis Fudos

Authors

Anastasia Moutafidou
View author publications
You can also search for this author in PubMed Google Scholar
Vasileios Toulatzis
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Fudos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ioannis Fudos.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 37814 KB)

A Appendix

In machine learning, a common task is the study and construction of algorithms, that can learn from and make predictions on data. Such algorithms work by making data-driven predictions or decisions, through building a mathematical model from input data. The data that are used to build the final model usually come from multiple datasets.

Table 8 Examples of four animation models from the test set

Full size table

For the test dataset used in several experiments throughout this paper, we have selected a set of human and animal models that are not included in the training datasets. The test dataset comprises 10 human and 10 animal models to ensure generalization and accuracy. For example, the testing dataset includes the animations shown in Figure 18, whose characteristics are reported in Table 8. Regarding the animated 3D models that we used for our training dataset or test set are mainly from Sketchfab [39], CGTrader [7], Turbo-squid [40] and Mixamo [32] Note that all dataset models are extracted from FBX animations which means that are fully animated with skeletal rigs, skinning information and transformations. However, we use only the generated animation sequence for training. The skeletal information is only used for comparison with the outcome (proxy bone structure) of our method.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Moutafidou, A., Toulatzis, V. & Fudos, I. Deep fusible skinning of animation sequences. Vis Comput 40, 5695–5715 (2024). https://doi.org/10.1007/s00371-023-03130-3

Download citation

Accepted: 03 September 2023
Published: 06 November 2023
Issue Date: August 2024
DOI: https://doi.org/10.1007/s00371-023-03130-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep fusible skinning of animation sequences

Abstract

Similar content being viewed by others

Temporal Parameter-Free Deep Skinning of Animated Meshes

Improving the Perceptual Quality of 2D Animation Interpolation

A perceptual quality metric for dynamic triangle meshes

1 Introduction

2 Related work

3 Deep skinning animation

3.1 Persistent bone labeling

3.2 A neural network architecture for weight prediction

3.3 Transformation and weight optimization

3.4 Fusion of LBS schemes

4 Experimental evaluation of deep fusible skinning

4.1 Error, compression and bandwidth metrics and measures

4.2 Quantitative results

4.3 Visual quality evaluation results

4.3.1 Quality measure evaluation

4.3.2 Visualization-based evaluation

4.3.3 Lighting quality evaluation

4.3.4 Quality comparison of fused schemes

4.4 Discussion and applications

5 Conclusions and future work

Data availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

A Appendix

A Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation