Problem setting and search variables
Our goal is to create a natural-looking transition from a source motion \(\mathbf {M}_\text {src}\) to a destination motion \(\mathbf {M}_\text {dst}\). This transition should be responsive: the user triggers it at an arbitrary frame \(t_\text {trigger}\) (e.g., via a gamepad), and the destination motion should then be played as quickly as possible while maintaining naturalness. We assume that a set of n motion clips \(\mathcal {M}= \{ \mathbf {M}_1, \ldots , \mathbf {M}_n \}\) (including \(\mathbf {M}_\text {src}\) and \(\mathbf {M}_\text {dst}\)) is in memory at runtime, so that the clips are available as candidates for hop motion clips.
We consider the case in which the temporal alignment between the source and destination motion clips is hard-constrained and cannot be altered by the transition calculation process. Note that satisfying this constraint is necessary for many practical scenarios. For example, in a fighting action game, a character might need to start playing a jump motion clip in response to player input to avoid an opponent’s kick; it is not acceptable in a typical game design if the temporal alignment offset of the jump motion clip relative to the triggered frame is altered by the transition algorithm every time. Another example is the case in which a character’s movement has to be synchronized with a certain factor (e.g., background music for a dance character) in both the source and destination states.
In precomputation, we search for optimal one-hop transitions for all possible triggered frames. Figure 4 illustrates the problem setting of the search for a certain triggered frame \(t_\text {trigger}\), which involves the following unknowns:
-
The index of the hop motion clip, \(i \in \{ 1, \ldots , n \}\).
-
The temporal alignment offset of the hop motion clip relative to the source motion clip, \(o_{\text {s}\rightarrow \text {h}}\in \mathbb {Z}\).
-
The first stay duration (i.e., the number of frames to wait for before starting the transition from the source motion clip to the hop motion clip), \(s_\text {s} \in \mathbb {Z}_{\ge 0}\).
-
The second stay duration (i.e., the number of frames to wait for before starting the transition from the hop motion clip to the destination motion clip), \(s_\text {h} \in \mathbb {Z}_{\ge 0}\).
-
The blend duration for the first transition from the source motion clip to the hop motion clip, \(b_{\text {s}\rightarrow \text {h}}\in \mathbb {Z}_{> 0}\).
-
The blend duration for the second transition from the hop motion clip to the destination motion clip, \(b_{\text {h}\rightarrow \text {d}}\in \mathbb {Z}_{> 0}\).
Among these variables, for simplicity we assume that the blend durations \(b_{\text {s}\rightarrow \text {h}}\) and \(b_{\text {h}\rightarrow \text {d}}\) are given; we thus omit them from the search variables. This is because optimizing them is a nontrivial problem. Some advanced techniques can find optimal blend durations (e.g., [37]), but for now we assign them a fixed value of 30 frames (0.5 sec in our implementation, which was selected following [31, 33]), for simplicity.
As a result, the search variables in our framework are expressed as follows. Let \(\mathcal {T}_t\) be a set of all the possible one-hop transitions for a triggered frame t. We represent a one-hop transition, \(T_t \in \mathcal {T}_t\), by a four-element tuple:
$$\begin{aligned} T_t = \left( i, o_{\text {s}\rightarrow \text {h}}, s_\text {s}, s_\text {h} \right) . \end{aligned}$$
(1)
Given a target triggered frame t, our framework seeks to find an optimal transition \(T^{*}_t \in \mathcal {T}_t\) (i.e., the best possible combination of i, \(o_{\text {s}\rightarrow \text {h}}\), \(s_\text {s}\), and \(s_\text {h}\)) that minimizes the duration necessary for the overall transition while maximizing the overall naturalness; this constitutes a trade-off.
Search for optimal one-hop motion transition
Search objective We solve the following discrete search problem for each possible triggered frame t:
$$\begin{aligned} T_t^{*} = \mathop {{\mathrm {arg}}~{\mathrm {min}}}\limits _{T_t \in \mathcal {T}_t} \{ S(T_t) + w R(T_t) \}, \end{aligned}$$
(2)
where \(S: \mathcal {T}\rightarrow \mathbb {R}_{\ge 0}\) is a smoothness cost function, \(R: \mathcal {T}\rightarrow \mathbb {R}_{\ge 0}\) is a responsiveness cost function, and \(w \in \mathbb {R}_{> 0}\) is a weight parameter for the user to control the trade-off between the transition smoothness and responsiveness, depending on the usage scenario. The smoothness cost function is defined as
$$\begin{aligned} S(T_t) = D_{\text {s}\rightarrow \text {h}}(T_t) + D_{\text {h}\rightarrow \text {d}}(T_t), \end{aligned}$$
(3)
where \(D_{\text {s}\rightarrow \text {h}}\) and \(D_{\text {h}\rightarrow \text {d}}\) are distance (or dissimilarity) functions that measure the distances between the source and hop motion clips and the hop and destination motion clips, respectively, during individual transitions. The responsiveness cost function should be a monotonically increasing function with respect to the duration of the overall transition. Our current implementation simply defines it as the sum of the stay durations:
$$\begin{aligned} R(T_t) = s_\text {s} + s_\text {h}. \end{aligned}$$
(4)
Choice of distance function Our current implementation uses a distance function proposed by Kovar et al. [16] which has been used in many follow-up works [6, 15, 17]. It spatially aligns the two motion fragments to be blended by finding an optimal 2D rigid transformation (i.e., horizontal translation and the yaw rotation) in the least-squares sense, and then it calculates the sum of the distances between corresponding points on the character’s body for every pair of frames. This metric is coordinate invariant and implicitly incorporates derivative information by considering multiple frames rather than a single frame. We do not repeat the specific equations here; refer to the original paper for the details. Figure 5 shows an example of computed distance function values for a pair of motion clips. We emphasize that the choice of the distance function is not our focus, and that our framework is orthogonal to that choice. It can use any other distance function, such as those considering joint orientations and velocities [23, 37], joint accelerations [1], and so on.
Search strategy We solve the search problem by using a simple exhaustive strategy with pruning. If at least one of three values, \(D_{\text {s}\rightarrow \text {h}}(T_t)\), \(D_{\text {h}\rightarrow \text {d}}(T_t)\), and \(w R(T_t)\), is larger than the objective value of the “best current” solution, then that \(T_t\) cannot be the optimal solution (because these three values are all nonnegative), and thus, we can prune that condition without calculating the other values. We calculate \(w R(T_t)\) first and then the other two, as \(w R(T_t)\) can be more efficiently calculated than the others can. Also, we visit smaller \(s_\text {s}\) and \(s_\text {h}\) first and then increment these values; because R is a monotonically increasing function with respect to both \(s_\text {s}\) and \(s_\text {h}\), doing so is likely to find lower-cost solutions at earlier stages of the search, which is helpful for effective pruning.
Storage and use of precomputed results Precomputed optimal transitions can be stored in a lookup table in which triggered frames are the keys for retrieval. By using a random-access array to implement the lookup table, the runtime cost of retrieval has only constant time complexity O(1) and is sufficiently fast for computational-resource-restricted scenarios such as video games. Furthermore, as a one-hop transition can be represented by using four integers (Eq. 1), the runtime module is not memory intensive. For example, when the state machine has 50 possible state transitions and every motion clip has \(1 \mathrm {k}\) frames, the runtime module uses \(800~{\mathrm {kB}}\) (\(= 50 \text { [state transitions]} \times 1 \mathrm {k} \text { [frames]} \times 4 \text { [integers]} \times 4 \text { [bytes]}\)), which is sufficiently small compared to the memory used by typical assets such as textures.
Discussion By selecting the destination motion clip as the hop motion clip and setting \(o_{\text {s}\rightarrow \text {h}}= o_{\text {s}\rightarrow \text {d}}\) and \(s_\text {s} = s_\text {h} = 0\), which is included in the search space, our one-hop transition can express the same transition as by the direct approach. Therefore, our framework is always guaranteed to find a transition that is equal to or better than that by the direct approach in terms of the search objective.
Default weight
Though it is important to let users adjust the weight parameter w in Eq. 2, it is helpful to provide a “default” value. We present a simple algorithm to find such a value as follows. First, from the set of target motion clips, we randomly sample N transition conditions, each consisting of source and destination motion clips, their temporal alignment, and a trigger frame. Then, we search for optimal transitions for these conditions without using stay durations (i.e., we set \(s_\text {s} = s_\text {h} = 0\)) and record the average smoothness cost value \(\bar{S}_{0}\). Next, we perform the search again but incorporate the stay durations as search variables, with an upper bound \(s_\text {max} \in \mathbb {Z}_{>0}\) and with \(w = 0\), and we record the average smoothness cost value \(\bar{S}_{s_\text {max}}\). Finally, we obtain a default weight as
$$\begin{aligned} w_\text {default} = \frac{{\bar{S}}_{0} - {\bar{S}}_{s_\text {max}}}{2 s_\text {max}}. \end{aligned}$$
(5)
Intuitively, this algorithm measures how much the smoothness cost decreases when using stay durations as compared to when not using them, and then it normalizes the value to find a reasonable relative scale between the smoothness and responsiveness costs. We use \(2 s_\text {max}\) as the denominator, because it is the upper bound of the responsiveness cost when not using stay durations. Another option is to use the average of the responsiveness costs. Either way is valid, because the smoothness and responsiveness costs become comparable through multiplication by the weight. Unless otherwise stated, the results presented in this paper were generated with this default weight automatically set (\(N = 100, s_\text {max} = 60\)).