A Majorization-Minimization Based Method for Nonconvex Inverse Rig Problems in Facial Animation: Algorithm Derivation

Automated methods for facial animation are a necessary tool in the modern industry since the standard blendshape head models consist of hundreds of controllers and a manual approach is painfully slow. Different solutions have been proposed that produce output in real-time or generalize well for different face topologies. However, all these prior works consider a linear approximation of the blendshape function and hence do not provide a high-enough level of details for modern realistic human face reconstruction. We build a method for solving the inverse rig in blendshape animation using quadratic corrective terms, which increase accuracy. At the same time, due to the proposed construction of the objective function, it yields a sparser estimated weight vector compared to the state-of-the-art methods. The former feature means lower demand for subsequent manual corrections of the solution, while the latter indicates that the manual modifications are also easier to include. Our algorithm is iterative and employs a Majorization Minimization paradigm to cope with the increased complexity produced by adding the corrective terms. The surrogate function is easy to solve and allows for further parallelization on the component level within each iteration. This paper is complementary to an accompanying paper, Rackovi\'c et al. (2023), where we provide detailed experimental results and discussion, including highly-realistic animation data, and show a clear superiority of the results compared to the state-of-the-art methods.


Introduction
Human face animation is an increasingly popular field of research within the applied mathematics community, of interest not only for the production of movies and video games but also in virtual reality, education, communication, etc.A common approach to face animation is using blendshapes Lewis et al. [2014], Çetinaslan [2016] -blendshapes are a vector representation of the face, where each vector b 1 , ..., b m ∈ R 3n stands for a single atomic deformation of the face, and a resting face is represented by b 0 ∈ R 3n .More complex expressions are obtained by combining the arXiv:2205.04289v2[math.OC] 27 Mar 2023 blendshape vectors, commonly using a linear delta blendshape function: where ∆b i = b i − b 0 for i = 1, ..., m, and w i ∈ [0, 1] are activation weights corresponding to each blendshape.These models provide intuitive controls, even though building the base shapes is demanding in terms of time and effort Neumann et al. [2013], Mengjiao et al. [2020], Li et al. [2013], Zhang et al. [2020].Inverse rig estimation is a common problem that consists of choosing the right activation weights w 1 , ..., w m from (1) to produce a predefined expression b ∈ R 3n .It is one of the aspects of blendshape animation that can be automated, hence it is often addressed in the literature, usually posed in a least-squares fashion as minimize w1,...,wm with possibly additional constraints or regularization.Here, and throughout the paper, • stands for the l 2 norm.
Possible approaches for solving the inverse rig can be classified into data-based and model-based techniques, where the first group demands large amounts of animation data for training purposes, and the second group works with only a blendshape function and the basis vectors.While various machine learning techniques show excellent performance Holden et al. [2016], Bailey et al. [2020], Song et al. [2020], Seonghyeon et al. [2021], Deng et al. [2006], Song et al. [2011], Seol and Lewis [2014], Holden et al. [2015], Feng et al. [2008], Yu and Liu [2014], Bouaziz et al. [2013], such methods demand long animated sequences that cover all the regular facial expressions to train a good regressor.This often makes them infeasible or unprofitable.We proposed a new model-based method for solving the inverse rig problem such that it includes the quadratic corrective terms, which leads to higher accuracy of the fit compared to the standard linear rig approximation Holden et al. [2016], Song et al. [2017].Our method utilizes a common framework of Majorization-Minimization Zhang et al. [2007], Lange et al. [2000].

Contributions
In the companion paper Racković et al. [2023], we present a novel method for solving the inverse rig problem when the blendshape model is assumed to be quadratic.This method targets a specific subdomain of facial animationhighly-realistic human face models used in movie and video games production.Here the accurate fit plays a more critical role than the execution time.For this reason, the added complexity of a quadratic blendshape rig is justified since it significantly increases the mesh fidelity of the result.Besides increasing the mesh accuracy, our solution yields fewer activated components than the state-of-the-art methods, which is another desirable property in production.The main contributions of the current paper are to provide a detailed derivation of the proposed inverse rig method and describe results on its convergence guarantees.We refer to Racković et al. [2023] for further practical and implementation aspects, as well as extensive numerical results on real-world animation data.
The rest of the paper is organized as follows.Section 2 formulates the inverse rig problem and covers the existing solutions from the literature.Section 3 introduces our algorithm and gives a detailed derivation of each step.Section 4 discusses numerical results.Finally, we conclude the paper in Section 5.

Problem formulation and preliminaries
The main components of the blendshape model are the neutral mesh b 0 ∈ R 3n sculpted by an artist, as well as a set of m blendshapes b 1 , ..., b m ∈ R 3n , where n is the number of vertices in the mesh.Blendshapes are topologically identical copies of a neutral mesh but with some vertices displaced in space to simulate a local deformation of the face.The offset between neutral mesh and blendshapes yields delta blendshapes ∆b i = b 0 + b i , for i = 1, ..., m, that are added on top of a neutral mesh b 0 , with a weight w i ∈ [0, 1], to produce an effect of local deformation as b 0 + w i ∆b i .
Multiple local deformers are usually combined to produce more complex facial expressions.A blendshape function can then be defined as where w = [w 1 , ..., w m ] T is a weight vector and B = [∆b 1 , ..., ∆b m ] is a blendshape matrix.The notation f L (•) indicates that this is a linear model, while for realistic face representation, it is common to consider a more complex form with quadratic terms, as explained in the following.
Some pairs of blendshapes, b i and b j , with an overlapping region of influence might produce artifacts on the face (mesh breaking or giving an unbelievable deformation), hence a corrective term b {i,j} needs to be included to fix any issues and make the character appearance natural.An artist discovers these combinations, and corrective terms are conventionally sculpted by hand.Now, whenever the two blendshapes are activated simultaneously, the corrective term is added as well, with a weight equal to the product of two corresponding weights.We define a set P that stores tuples of indices (i, j) such that a pair of belndshapes i and j have a corresponding corrective term b {i,j} .A quadratic blendshape function is then defined as: In production, it might be essential to solve the inverse problem.That is, considering there is a given mesh b, that is conventionally obtained either as a 3D scan of an actor or a capture of the face markers, one needs to estimate a configuration of the weight vector w that produces a mesh as similar as possible to b.This problem is known as the inverse rig problem or the automatic keyframe animation.As we will discuss in the next section, it is common to pose this in a least-squares setting.

Existing solutions
When we consider a model-based solution to the inverse rig problem, the state-of-the-art method is Cetinaslan and Orvalho [2020], where the optimization problem is formulated as regularized least-squares minimization: Here, and throughout the paper, • stands for the l 2 norm.Regularization is necessary because many blendshape pairs are highly correlated, i.e., produce relatively similar deformations over the mesh; hence, the unregularized problem is often ill-posed, and the solution is not unique.Additionally, it is desired to keep the number of non-zero elements of w low because it allows for further manual editing, which is common in animation.The solution to ( 5) is given in a closed-form as w = (B T B + αI) −1 B T b.
(6) A modification to this solution is given in the same paper.Authors approximate a blendshape matrix B with a sparse matrix B loc , by applying a heat kernel over the rows of an original blendshape matrix.This sets low values to zero, meaning that effects of the least significant blendshapes are excluded for each vertex.
A different approach is given in Seol et al. [2011], where components are visited and optimized sequentially, and after each iteration i = 1, ..., m, a residual term is updated: Here step 1 finds the optimal activation of a single controller w i , and step 2 removes its effect for subsequent iterations.This yields a sparse weight vector and excludes the possibility of simultaneously activating mutually exclusive blendshapes (like mouth-corner-up and mouth-corner-down).However, the order in which the blendshapes are visited is crucial to obtain an acceptable solution.The authors propose ordering them based on the average offset that each blendshape causes over the whole face.experiments on the method.Our algorithm targets specifically a high-quality realistic human face animation, hence we assume that a real-time execution is not a priority and that the activation weights are strictly bounded to [0, 1] interval1 .We first explain the algorithm derivation and then detail each algorithm step.The optimization problem looks for the optimal weight vector configuration w that fits on a given mesh b, assuming a quadratic blendshape function (4), as where the first term is data fidelity and the second is regularization.The non-negativity constraint is important in blendshape animation since negative weights have no semantical meaning and make it harder for animators to adjust the obtained results manually.While weights larger than one might be useful for exaggerated cartoonish expressions, in realistic human avatars this is not a favorable behavior.The regularization term with α > 0 enforces a low cardinality of the solution vector, which is a desired feature as it makes the results more artist-friendly Seol et al. [2011].The problem is approached in a fashion similar to the Levenberg-Marquardt method Ananth [2004], where we choose the initial vector of controller weights w (0) ∈ R m , and at each iteration t = 0, ..., T , solve for an optimal increment vector v ∈ R m that solves the following optimization problem: The weights vector is updated as w (t+1) = w (t) + v, and the process is repeated until convergence.Under the quadratic approximation of the rig function, the objective function in ( 9) is fairly complex, hence we simplify it by applying Majorization-Minimization (MM).That is, to solve (9), we define an upper bound function ψ(v; w) : R m → R over the objective function in ( 9) such that it is easier to minimize and satisfies Conditions 1 and 2 given below.Note that these conditions define a class of functions ψ(v; w) as potential majorizers to the objective ( 9), but later in this section, we define a specific choice of ψ(•) used in the proposed algorithm.
Condition 1.For any feasible vector 0 ≤ w ≤ 1, for all the values of an increment vector v such that 0 ≤ w + v ≤ 1, the following holds: Condition 2. The upper bound ψ(v; w) matches the value of the objective (9 In the rest of the paper we will write ψ(v) instead of ψ(v; w), for the sake of simplicity.Further we proceed with the problem in the form minimize and the following proposition gives us guarantees that such an approach leads to the minimization of the original objective.
Proposition 1.Under Conditions 1 and 2, a sequence of iterates w (t) for t ∈ N obtained as the solutions to problem (12) (with w (t+1) = w (t) + v (t) ) produces a monotonically non-increasing sequence of values of the objective (8), i.e., Proof.If v (t) is a minimizer of (12) at iteration t, i.e., then we have ψ(0) ≥ ψ(v (t) ).From this and from Conditions 2 and 1, we have the following relation: which proves the proposition.
Additionally, from Wu [1983], we have that under the above iterative method, the objective function converges to a local optimum or a saddle point as number of iterations t goes to infinity.
Upper Bound Function.Let us now introduce a specific majorizer that we apply in the proposed algorithm, and below we will give a complete derivation.We define the upper bound function of objective (9) as which is an original regularization term added on a sum of coordinate-wise upper bounds ψ i (v) : R m → R of the data fidelity term, where n is the number of vertices in the mesh.The component-wise bounds have the form where g i := B i w + w T D (i) w − b i , and h i := B i + 2w T D (i) are introduced to simplify the notation; D (i) ∈ R m×m is a symmetric (and sparse) matrix whose nonzero entries are extracted from the corrective blendshapes as ; the largest singular value of a matrix where λ min (D (i) ) represents the smallest and λ max (D (i) ) the largest eigenvalue of D (i) .Under the surrogate function ( 15), the problem ( 12) can be analytically solved component-wise, where for each component j = 1, ..., m, the objective is a scalar quartic equation: The expressions for the polynomial coefficients q, r, s are Notice that the coefficient q depends on a coordinate j, so it has to be computed for each controller separately, while r and s are calculated only once per iteration.We can find the extreme values of the polynomial using the roots of the cubic derivative q + 2rv j + 4sv 3 j = 0, and, if they are within the feasible interval [0, 1], compare them with the polynomial values at the borders to get the constrained minimizer.The idea of component-wise optimization of the weight vector makes our approach somewhat similar to that of Seol et al. [2011], yet we do not update the vector until all the components are estimated.This solves the issue of the order in which the controllers are visited and additionally gives a possibility for a parallel implementation of the procedure.Another difference is that Seol et al. [2011] considers only a single run over the coordinates, while for us, this is only a single step -we refer to it as an inner iteration.In this sense, we start with an arbitrary and feasible vector and repeat the inner iteration multiple times to provide an increasingly good estimate of the solution to (9), in a manner similar to the trust-region method Tarzanagh et al.
[2015], Porcelli [2013].A complete inner iteration is summarized in Algorithm 1.In the following lines, we will detail a complete derivation of the upper bound given in (15).Derivation of the Upper Bound.If we write down the data fidelity term from (9) as a sum, and consider each element i = 1, ..., 3n of the sum separately, we can represent it in a canonical quadratic form: , which is a single coordinate of the data fidelity term.When we add the increment vector v on top of the current weight vector w, this yields: The data fidelity term is a sum of functions φ i (w), hence in order to bound the objective, we bound each element of the sum φ i (w + v) ≤ ψ i (v).Let us first consider two nonlinear terms of φ i (w), and bound each of them separately.A bound over the quadratic term 2g i v T D (i) v depends on the sign of g i : The bound over a quartic term 2 is obtained by applying the Cauchy-Schwartz inequality three times: The component-wise bound ψ i (v) is then given by ( 16) and a complete upper bound is obtained by summing the component-wise bounds and adding the regularization term (Eq.15).
Compute coefficients q, r,s from Eq. ( 18) and solve for an optimal increment vector v: Feasibility.Let us now state another proposition showing that the above derived upper bound function is feasible for Proposition 1.
Proof.In ( 20)-( 22), function ψ(•) is derived so that it bounds the objective function ( 19) from above, hence it satisfies Condition 1 by construction.Now recall that the data fidelity term is a sum of functions φ i (•).To prove that Condition 2 is satisfied, it suffices to show that φ i (w + 0) = ψ i (0).From (20) we see that φ i (w + 0) = g 2 i , and from ( 16) we get ψ i (0) = g 2 i , which proves it.Hence, the derived bound satisfies Proposition 1.
Complete Algorithm.As mentioned above, our solution is iterative, and based on applying Algorithm 1 until convergence.In each iteration t, the weight vector is updated by adding an estimated increment vector: w (t+1) = w (t) + v.The algorithm terminates when either a given maximum number of iterations is reached, or the difference between the cost function in two consecutive iterations is below a given threshold value > 0. While any feasible vector 0 ≤ w ≤ 1 can be used for initialization, we can rely on domain knowledge to choose a good starting point leading to faster convergence, and these strategies are discussed in the companion paper Racković et al. [2023].A complete method is summarized in Algorithm 2. Notice that in one of the steps of the algorithm, we compute eigenand singular values of matrices D (i) .This computation is needed only once per animated character, and we can reuse the calculated values for each following frame that is to be fitted.
Corollary 1.The estimate sequence w (t) produced by Algorithm 2 is feasible for problem (8) at all iterations t ∈ N, as long as the initial weight vector w (0) is feasible.
For each coordinate i = 1, ..., 3n compose a matrix D (i) ∈ R m×m from the corrective terms, and extract singular and eigen values (σ, λ min , λ max ): for i = 1, ..., 3n do for (j, k) ∈ P do end for Repeat Algorithm 1 until convergence: for t = 0, ..., T do Compute optimal increment v using Algorithm 1 Update the weight vector w (t) :

Results
The experiments in this section are performed over an animated avatar Omar, available at Metahuman Creator2 .The character consists of m = 130 base blendshapes and n = 4000 vertices in the face.An optimized choice of regularization parameter α = 5 is estimated experimentally from the training data, to give a good trade-off between the high accuracy of the mesh fit and a low cardinality of the weight vector.In our experiments, the weight vector is initialized by ( 6), where the weights outside of the feasible set are clipped to 0 or 1, and then further optimized using our algorithm.While any feasible vector can be used for initialization, this choice has shown quick convergence and precise mesh fit.
Let us first examine the relationship between the objective function ( 8) and its corresponding surrogate function (15).Figure 1 shows several example frames and the behavior of two functions in the initial iterations.The blue cross represents the value of the function at an initial iterate w t , i.e., for v = 0 (As stated in Condition 2, the two functions have the same value in this point).Values of the two functions are then denoted by the corresponding color cross in the estimated iterate w t + v t .The other values of the two curves are obtained by interpolating the two vectors w t and w t + v t and estimating the corresponding function value.These results clearly show how minimizing the upper bound leads to a nice decrease in the objective.
Further, we show the final results of the method in Figure 2. On the left side of the subfigures, one can see a barplot representing the estimated activation weights in the range [0,1] -we can see that the vector is relatively sparse, and most of the weights have low values.The average cardinality of the weight vector, over 500 test frames, is 85.Below we have the behavior of the objective function over iterations of the algorithm.Green crosses represent the value of the function (8) for the estimated iterate w t + v t , while red pluses denote the value of the corresponding upper bound function (15), in line with the results presented in Figure 1.The plot zooms on initial iterations to show how the proposed bound is tight, and the gap between the surrogate and objective quickly decreases.We can also notice a nice decrease in the objective, confirming the above-stated monotonicity and convergence properties.
Finally, on the right side of the subfigures, we show the reference mesh b in gray, versus the reconstruction obtained by the proposed method.The reconstructed mesh is colored so that red tones indicate a higher error, according to the color bar on the right.We can see that in each case, the reconstruction is really close to the original frame, and even the regions indicated by the red color are not visually different.The average root mean squared error over 500 test frames is 0.09.
For a detailed numerical discussion and comparison with other methods see Racković et al. [2023] -the reference examines results over multiple animated characters and shows that the proposed method produces accurate and sparse solutions to the inverse rig problem.

Conclusion
This paper introduced the first model-based method for solving the inverse rig problem under the quadratic blendshape function.The proposed algorithm is based on the applications of Majorization-Minimization. A specific surrogate function is derived, and we provide guarantees that it leads to a non-increasing cost sequence of the original non-convex optimization problem (8).
The algorithm targets a high-quality facial animation for the video games and movie industry, and hence it assumes that the higher mesh fidelity is more important than the computation speed.For this reason, we rely on the quadratic blendshape function that is more precise than the standard linear one, and also that the weights are strictly constrained to a [0, 1] interval to avoid exaggerated or non-credible expressions.
While this paper had the purpose of giving a complete derivation of the proposed algorithm, Racković et al. [2023] presents an in-depth discussion of the applications and results.
Several example frames showing a relationship between the objective function (green) and the proposed surrogate function (red), at iteration t.A current iterate w t is marked by a blue cross, and the estimated next iterate w t + v t is marked by a red cross showing the values of a surrogate function, and by a green one, indicating the actual value of the objective function.