1 Introduction

Motion capture technology has been widely used to create natural human animations in real-time live applications as virtual training, virtual prototyping, computer games and computer animated puppetry [1]. Passive optical motion capture system, like VICON [2], is used in most applications because of its high precision and low intrusion. However, it only records 3D markers’ positions without any physical meaning (unlabeled). In addition, markers may often disappear and/or re-appear during the motion sequence due to limbs occlusion or self-occlusion, which makes marker labeling task for a live motion capture sequence be a big challenge.

The goal of practical marker labeling task is to (1) solve the correspondences problem for moving markers while (2) provide a solution to deal with missing and/or ghost markers which will lead to motion reconstruction ambiguities. Unlike marker labeling method in offline manner [3], we mainly aim to achieve the second goal, especially when both accuracy and efficiency need to be considered in real-time live applications [4] and interactive applications [5].

In this paper, we present a novel online marker labeling approach based on graph matching model and human pose reconstruction process to produce accurate and efficient marker labeling results for real-time live applications with missing/ghost markers, as illustrated in Fig. 1. Specifically, by regarding labeled markers at previous frame and unlabeled markers at current frame as model graph and data graph, respectively, we formulate marker labeling problem as soft graph matching, which is an essential combinatorial optimization problem solved by Hungarian algorithm to achieve high efficiency. In order to achieve high labeling accuracy, we also design a nonlinear optimization process to estimate the positions of missing markers.

Fig. 1
figure 1

Online marker labeling process overview. Given previously labeled marker data (left three columns), our approach automatically labels the raw marker data (top right image) captured by motion capture system

We demonstrate the power of our approach by comparing against alternative state-of-the-art methods and commercial system as VICON on a wide range of motion capture data with missing/ghost markers. First, we show our outperformed accuracy and efficiency on single subject motion sequences and two interactive subjects motion sequences (Sect. 5.1). Then, we show our outperformed pose reconstruction accuracy on single subject motion sequences (Sect. 5.2). Finally, we show the accurate marker labeling results to demonstrate the capability of handling with ghost markers as well as facial motions with none rigid constraints (Sect. 5.3). Due to page limitation, please see the supplementary video for more evaluation results. Note that, since we only focus on correctly solving the marker labeling problem not motion denoising [4] or missing marker estimation [6, 7] problem, we only show the evaluations compared against alternative marker labeling methods.

In summary, our main contributions are as follows: (1) A novel accurate and efficient marker labeling process in real-time live manner; (2) A soft graph matching model that automatically labels the markers in successive frames by using Hungarian algorithm for finding the global optimal matching solution.

2 Related work

Our online marker labeling method is related to point correspondence and graph matching methods.

2.1 Point correspondence

Yu [8] proposed online tracking framework for multiple interacting subjects by constructing a motion model to find best marker correspondences, which is a greedy algorithm leading to local optimum and it must be at least two visible markers on the same limb. Similar to [8], we also use the tracking framework, but we introduce a soft graph matching model instead of using the example data [8] to improve the labeling accuracy.

Li [9] proposed a self-initializing identification labeling method on each segment for establishing local segmental correspondences. Li [10] designed a similarity k-d tree to identify markers from similar poses of two objects, but cannot deal with missing data. Li [11] integrated key-frame-based self-initializing hierarchical segmental matching [9] with inter-frame tracking to label the articulated motion sequence presented by feature points which is an offline approach.

Mundermann Articulated-ICP algorithm with soft-joint constraints [12] is used to track limbs from dense images. In our case, full-body and facial motions are represented by 3D sparse points known as markers attached on the subjects’ skin. When markers are missing, which happens frequently during motion capture process, it is almost impossible to use ICP-based methods [13, 14] to find the marker correspondences in successive frames. Others [15, 16] formulate dense points into lines, curves or surfaces to get non-rigid transformations. The necessary spatial data continuity is again not available in the case of sparse points [11].

Probabilistic inference with points’ topology is used to find point correspondences for 2D non-rigid points [17, 18] and 3D dense surface points [19, 20]. Different from them, we propose the soft graph matching model with discrete combinatorial optimization algorithm to find 3D sparse marker correspondences in online manner by solving a problem.

2.2 Graph matching

Graph matching plays a central role in solving correspondence problem. According to whether the graph edges are taken into account or not, graph matching can be divided into two categories: unary graph matching and binary graph matching.

Unary graph matching treats each node independently, discarding the relationships between nodes. The model was used by Veenman [21] to solve point tracking problem. They proposed an adaptable framework which can be used in conjunction with a variety of cost functions [22, 23], which can be solved by Hungarian algorithm optimization [24].

Binary graph matching considers both node and edge attributes. The problem is non-polynomial and a lot of effort has been made in finding good approximate solutions [25]. Probably the fastest approximation solution to the problem is presented in  [26], the authors present an efficient spectral method. After relaxing the mapping constraints and the integral constraints, the principle eigenvector of matching cost matrix is interpreted as the confidence of assignments. The assignment with the maximum confidence and consistent with the constraints is accepted as a correct assignment. But as stated in [27, 28], the correspondence accuracy of the method is not very satisfactory. Torresani [28] apply “dual decomposition” approach that decomposes the original problem into simpler sub-problems, which are repeatedly solved independently and combined into a global solution. The author claims that it is the first technique capable of reaching global optimality on various real-world image matching problems and outperforms existing graph matching algorithms. In fact, their method needs several seconds to process a picture with 30 nodes.

Existing graph matching methods cannot solve our problem with both accuracy and efficiency. Unary graph matching is efficient but inaccurate due to the neglect of edge constraints. Binary graph matching is more accurate by taking both motion smoothness and edge constraints into account. But it is too complex to get the optimal solution in real time. In this paper, we take both advantages of unary and binary graph model and present soft graph matching model by merging the matching cost of local geometrical structure consisting of edges into the matching cost of graph nodes to achieve high accuracy and efficiency simultaneously.

3 Soft graph matching

We define labeled marker set at previous frame as Model Graph represented by \(G_{1}=(V_{1},E_{1})\), then unlabeled marker set at current frame as Data Graph represented by \(G_{2}=(V_{2},E_{2})\), respectively. \(V_{1}=\{\mathbf {m_{i}}:i=1,\ldots ,M\}\) and \(V_{2}=\{\mathbf {u_{j}}:i=1,\ldots ,N\}\) are node (markers) sets. \(E_{1}\) and \(E_{2}\) are edge sets. \(\mathbf {m_{i}}\) and \(\mathbf {u_{i}}\) are labeled and unlabeled 3D marker positions, respectively. We connect two markers by an edge \(\mathbf {m_{i}}\mathbf {m_{j}}\) if they are neighbors on the same limb and their relative position keeps fixed over time, as we call this local rigid constraints.

Fig. 2
figure 2

An example of soft graph matching model including of model and data graph. Markers \({m_{1},\ldots ,m_{4}}\) are nodes in model graph at previous frame, and unlabeled markers \({u_{1},\ldots ,u_{7}}\) are nodes in data graph at current frame. Markers within circles (dashed circles) are selected as candidate assignments. The involved edge matching cost is shown in Table 1

We assume the number of markers in \(V_{1}\) and \(V_{2}\) is the same, i.e., \(M=N\). In the case of \(M \ne N\) which is caused by missing markers and ghost markers, we will add dummy markers to \(V_{1}\) and \(V_{2}\) to make the condition holds. The matching cost related to the dummy markers is set to the maximum cost \(w_{max}\).

Let \(\{\phi _{ij}: i=1,\ldots ,M; j=1,\ldots ,N\}\) denote all possible matches between model and data graph, and \(\{c_{ij}, i=1,\ldots ,M; j=1,\ldots ,N\}\) be their matching cost. We assume L as the correct label which is a set of marker match. \(x_{ij}\) is indicator variable, equals to 1, if \(\phi _{ij} \in L\), and to 0 otherwise.

Our method considers the marker and its local geometrical structure simultaneously. The edges starting from a marker make up its local geometrical structure. Thus, marker labeling problem can be formulated as the following soft graph matching.

$$\begin{aligned}&\min _{\mathbf {x}}\;cost(\mathbf {x})=\sum _{a}[\omega _{p}c_{p}(a)x_{a} + (1-\omega _{p})c_{lg}(a)x_{a}]\end{aligned}$$
(1)
$$\begin{aligned}&c_{p}(a) = \Vert \mathbf {m_{1}}-\mathbf {u_{1}}\Vert ^{2} \end{aligned}$$
(2)

where \(c_{p}\) and \(c_{lg}\) are matching cost of point and its local geometrical structure, respectively, \(\omega _{p}\) is weight, and a is any possible match (ij). We use the classic Hungarian algorithm to solve above combinatorial optimization problem. In our experiment, we set \(\omega _{p}=0.5\).

Figure 2 explains how to calculate the matching cost of marker correspondence and its local geometrical structure. Let \(\phi _{a}\) denote the correspondence \(\mathbf {u_{1}}\) to \(\mathbf {m_{1}}\). The cost of point match caused by \(\phi _{a}\) is defined as the following spatial distance.

Table 1 The matching cost of edges related to \(\phi _{a}\)

As we can see in Fig. 2, \(\mathbf {m_{1}}\) and \(\mathbf {m_{2}}\) are connected by an edge \(\mathbf {m_{1}m_{2}}\). If \(\mathbf {u_{1}}\) is matched to \(\mathbf {m_{1}}\) and \(\mathbf {u_{3}}\) is matched to \(\mathbf {m_{2}}\), the relative position between \(\mathbf {u_{1}}\) and \(\mathbf {u_{3}}\) must meet edge constraint of \(\mathbf {m_{1}}\) and \(\mathbf {m_{2}}\). We use \(c_{e}(\mathbf {m_{i}m_{j}},\mathbf {u_{i'}u_{j'}})\) to denote the matching cost of edge. Let \(\mathbf {m_{j}}\) denote the markers connected with \(\mathbf {m_{1}}\) and \(\mathbf {u_{j'}}\) be the candidate assignment of \(\mathbf {m_{j}}\), the local geometrical matching cost of \(\phi _{a}\) is defined as:

$$\begin{aligned} c_{lg}(a) = \frac{1}{|\mathbf {m_{j}}|} \sum _{\mathbf {m_{j}}}\min _{\mathbf {u_{j'}}}c_{e}\left( \mathbf {m_{1}m_{j}},\mathbf {u_{1}u_{j'}}\right) \end{aligned}$$
(3)

where \(|\mathbf {m_{j}}|\) is the number of markers connected with \(\mathbf {m_{1}}\) and \(c_{e}(\mathbf {m_{1}m_{j}},\mathbf {u_{1}u_{j'}})\) is defined as:

$$\begin{aligned} c_{e}(\mathbf {m_{1}m_{j}},\mathbf {u_{1}u_{j'}})= & {} (\Vert \mathbf {u_{1}} - \mathbf {u_{j'}} \Vert - d_{\mathbf {m_{1}m_{j}}})^{2} \nonumber \\&+ \omega _{a} \left( 1 - \frac{(\mathbf {m_{1}} - \mathbf {m_{j}})\cdot (\mathbf {u_{1}} - \mathbf {u_{j'}})}{\Vert \mathbf {m_{1}} - \mathbf {m_{j}}\Vert \Vert \mathbf {u_{1}} - \mathbf {u_{j'}}\Vert }\right) ^{2}\nonumber \\ \end{aligned}$$
(4)

where \(d_{\mathbf {m_{1}m_{j}}}\) is the distance between \(\mathbf {m_{1}}\) and \(\mathbf {m_{j}}\) which is obtained from the previous frame and is updated over time. The 1st term in Eq. 4 measures the difference of length between two edges and the 2nd term is the difference of their direction. The inconsistency between the unit of length and the unit of cosine angle is compensated by \(\omega _{a}\). In our experiment, we set \(\omega _{a}=1\hbox {e}4\). The matching cost of different edges is then averaged to form the local geometrical matching cost of \(\phi _{a}\).

Let \(\phi _{a} = \phi _{ij}\), our soft graph matching model is defined as follows:

$$\begin{aligned} \min _{\mathbf {x}}&\quad cost(\mathbf {x})=\sum _{i}\sum _{j}w_{ij}x_{ij}, \quad \text {where} \nonumber \\ w_{ij}&= \left\{ \begin{array}{ll} \omega _{p}c_{p}(a) + (1-\omega _{p})c_{lg}(a) , &{} \; j \in b(i) \\ w_\mathrm{max}, &{} \; j \notin b(i) \end{array} \right. \nonumber \\ \hbox {s.t.}&\quad x_{ij}\in \{0,1\} \nonumber \\&\quad \sum _{i}^{}x_{ij}=1, \text {for all}\, j, \quad \sum _{j}^{}x_{ij} = 1, \text {for all}\,i \end{aligned}$$
(5)

where \(w_\mathrm{max}\) is an experimentally defined maximum cost, and candidate match b(i) of marker i are selected as a set of markers the distance between which and the predicted marker using Kalman filter [30] is less than a specific threshold. We use the Hungarian algorithm to find the best matching, which is super-fast as the calculation of matching cost is only done on the selected candidate assignments for each marker.

As our marker labeling method is for real-time live applications, we use a simple method to automatically label all markers at the 1st frame. Specifically, first, we instruct the subjects to perform their motions starting from T-pose with all markers visible. Then, based on the prior knowledge of current subject’s skeleton T-pose model and marker offsets relative to the inboard joints, we perform a nonlinear optimization process to fit the model into the captured markers at 1st frame by minimizing the distances between the markers on the model and captured 1st frame.

4 Missing marker estimation

Motion capture raw data often contains missing markers due to limb occlusions and self-occlusions, which will lead to low accuracy in marker labeling process. Here we propose a nonlinear optimization process to solve the problem. First, we reconstruct the current pose using Inverse Kinematics technique. Then, we use the reconstructed pose and edge constraint to estimate the position of occluded markers.

We define human body pose using a set of independent joint coordinates \(\mathbf {\theta \in R^{42}}\), including absolute root position and orientation as well as the relative joint angles of individual joints. These bones are head (1 Dof), neck (2 Dof), lower back (3 Dof), and left/right shoulders (2 Dof), arms (3 Dof), forearms (1 Dof), hands (3 Dof), upper legs (3 Dof), lower legs (1 Dof), and feet (2 Dof).

We reconstruct current frame pose \(\mathbf {\theta ^{t}}\) by minimizing an objective function consisting of four terms:

$$\begin{aligned} \min _{\mathbf {\theta ^{t}}}&\quad&\lambda _{1}E_{O} + \lambda _{2}E_{P} + \lambda _{3}E_{S} + \lambda _{4}E_{C} \end{aligned}$$
(6)

where \(E_{O}\), \(E_{P}\), \(E_{S}\) and \(E_{C}\) represent the observed term, predicted term, smoothness term and constraint term, respectively. The weights \(\lambda _{1}\), \(\lambda _{2}\), \(\lambda _{3}\) and \(\lambda _{4}\) control the importance of each term and experimentally set to 0.05, 0.15, 0.8 and 0.1, respectively. We describe details of each term as follows.

The observed term measures the distance between the labeled observed markers and corresponding markers from reconstructed pose:

$$\begin{aligned} E_{O}=\sum _{i=1}^{M}[(1-o_{i}^{t})(e_{i}(\mathbf {\theta ^{t}}) - \mathbf {m_{i}^{t}})^{2}] \end{aligned}$$
(7)

where \(e_{i}(\mathbf {\theta ^{t}})\) is the forward kinematics function that computes ith marker position with the prior knowledge of the user’s skeleton, \(\mathbf {s_{v}}\), and markers’ offsets, \(\mathbf {l_{v}}\), relative to the inboard joint. \(o_{i}^{t}\) is a binary weight, and equals to 0 if ith marker is occluded, and to 1 otherwise. Reconstructing the motion sequence from only this constraint is the same as performing per-frame inverse kinematics as in [29].

The predicted term According to the Kalman filter [30], we can get a probabilistic distribution of the 3D position of the occluded markers. Suppose that \(x_{i}^{t-1}\) is the hidden state vector as the 3D position and velocity of marker i, \(y_{i}^{t}\) is the measurement vector as the captured position or estimated position (when the marker is occluded) of the same marker. The reconstructed pose should maximize the conditional distribution \(\mathbf {y_{i}^{t}} | \mathbf {x_{i}^{t-1}}\) , which is a normal probability distribution

$$\begin{aligned} P(\mathbf {y_{i}^{t}} | \mathbf {x_{i}^{t-1}}) = \frac{\exp \left( -\frac{1}{2}(\mathbf {y_{i}^{t}} - \mathbf {\mu _{i}^{t}}\right) ^{T} (\varGamma _{i}^{t}) ^{-1} (\mathbf {y_{i}^{t}} - \mathbf {\mu _{i}^{t}}))}{(2\pi )^{\frac{d}{2}}|\varGamma _{i}^{t}|^{\frac{1}{2}}} \end{aligned}$$
(8)

with the mean and variance

$$\begin{aligned} \mathbf {\mu _{i}^{t}}=\mathbf {v_{i}^{t}}, \quad \varGamma _{i}^{t}= H_{i}^{T}H_{i}\varLambda +\varSigma \end{aligned}$$
(9)

where d is the dimension of \(\mathbf {y_{i}^{t}}\), \(|\varGamma _{i}^{t}|\) is the determinant of the covariance matrix \(\varGamma _{i}^{t}\), \(H_{i}\) is the measurement matrix which relates the hidden state \(x_{i}^{t}\) to the measurement \(y_{i}^{t}\), \(\varLambda \) and \(\varSigma \) are the process noise covariance and the measurement noise covariance, respectively.

We minimize the negative log of \(P(\mathbf {y_{i}^{t}} | \mathbf {x_{i}^{t-1}})\), yielding the formulation:

$$\begin{aligned} E_{P}=\sum _{i=1}^{M}[o_{i}^{t}(e_{i}(\mathbf {\theta ^{t}}) - \mathbf {v_{i}^{t}})^{T} (\varGamma _{i}^{t}) ^{-1} (e_{i}(\mathbf {\theta ^{t}}) - \mathbf {v_{i}^{t}})] \end{aligned}$$
(10)

The smoothness term is used to enforce temporal smoothness by penalizing the velocity change between current reconstructed pose \(\mathbf {\theta ^{t}}\) and two previous ones \([\mathbf {\theta ^{t-1}}, \mathbf {\theta ^{t-2}}]\) through time:

$$\begin{aligned} E_{S} = \Vert \mathbf {\theta ^{t}} - 2\mathbf {\theta ^{t-1}} \ + \mathbf {\theta ^{t-2}}\Vert ^{2} \end{aligned}$$
(11)

The constraint term is used to prevent the pose from reaching an impossible posture by over bending the joints. We limit the joint angles by following equation:

$$\begin{aligned} E_{C}=\sum _{\mathbf {\theta _{i}^{t} \in \theta ^{t}}}[\underline{\beta }(i)(\mathbf {\theta _{i}^{t}} -\mathbf {\underline{\theta _{i}}})^2 + \overline{\beta }(i)(\mathbf {\theta _{i}^{t}} -\mathbf {\overline{\theta _{i}}})^2] \end{aligned}$$
(12)

where each body joint is associated with conservative bounds [\(\mathbf {{\underline{\theta _{i}}}, \overline{\theta _{i}}}\)]. For the bounds, we use the values measured by the biomechanical literature [31]. \(\underline{\beta }(i)\) and \(\overline{\beta }(i)\) are indicator functions. \(\underline{\beta }(i)\) evaluates to 1 if \(\mathbf {\theta _{i}^{t}<\underline{\theta _{i}}}\), and to 0 otherwise. \(\overline{\beta }(i)\) is equal to 1 if \(\mathbf {\theta _{i}^{t}>\overline{\theta }_{i}}\), and to 0 otherwise.

We use Quasi-Newton BFGS optimization [32] to solve the optimization problem in Eq. 6. We initialize the pose reconstruction process without the smoothness term for the 1st frame. Each frame takes 3–5 iterations to converge for most cases.

The pose reconstruction process keeps the motion tendency of the missing markers by maintaining rigid body constrains, so we can estimate missing markers from the reconstructed pose. Specifically, by assuming the relative position of two markers (\(\mathbf {m_{j}^{t}, m_{i}^{t}}\)), on a same limb, which we call neighbor markers, and they are fixed at any time during the motion sequence, we can get the missing markers from the reconstructed pose:

$$\begin{aligned} \mathbf {m_{i}^{t}}=\frac{1}{|j|}\sum _{j}[\mathbf {m_{j}^{t}}-(e_{j}(\mathbf {\theta ^{t}}) - e_{i}(\mathbf {\theta ^{t}}))] \end{aligned}$$
(13)

where j is the neighbor marker of i that is visible at current frame.

When one marker and most of its neighbors are missing at the same time, we use an iterative scheme to recover the missing markers. First, we recover the missing marker whose neighbors are visible, and then use the recovered marker to estimate other occluded markers. If all markers on a same limb are missing at the same time, we directly use the virtual markers on the reconstructed pose as the recovered ones.

5 Experimental results

We demonstrate the power of our approach by comparing against alternative state-of-the-art methods and commercial system as VICON on a wide range of motion capture data. First, we show our outperformed accuracy and efficiency on single and double interactive subjects motion sequences (Sect. 5.1). Then, we show our outperformed pose reconstruction accuracy on single subject motion sequences (Sect. 5.2). Finally, we show the accurate marker labeling results to demonstrate the capability of handling with ghost markers as well as facial motions with none rigid constraints (Sect. 5.3). All of the tests are done on a 4-core 2.4 GHz CPU with 2 GB RAM. We use the labeled markers at the 1st frame to initialize Kalman filters and the relative positions between markers on a same limb. We use identification rate of marker trajectories \(\zeta \), which is defined as the ratio of the number of correctly labeled marker trajectories with respect to total number of trajectories, to represent the marker labeling accuracy.

Fig. 3
figure 3

Comparison results: identification rate of marker trajectory (\(\zeta \)) versus embedded noise level (Eq. 14)

Table 2 Efficiency (fps) of different labeling methods

5.1 Performance on CMU motion capture data

We compare our method against alternative methods: Yu [8] (YLD), the closest point based approach (CP), binary graph matching [26] (LH) and original unary graph matching (UGM). The CP approach assumes that correct correspondence is the closest point in the next frame. As for binary and unary graph matching, the definition of matching cost of point and edge correspondence takes the form of Eqs. 2 and 4, respectively. We test on 665 CMU MoCap data [33] (totally 816,000 frames) including walk, run, jump, kick, punch, roll, dance, skateboard, basketball, etc. The original data are captured at 120 fps. All the data are classified according to the embedded noise level that is defined as:

$$\begin{aligned} \eta = \max _{t,i,j} \frac{\Vert \mathbf {m_{i}^{t}} - \mathbf {m_{j}^{t}}\Vert - d_{ij}}{d_{ij}}. \end{aligned}$$
(14)
Fig. 4
figure 4

Comparison results with down-sampling: percent of incorrectly labeled marker trajectory (\(1-\zeta \)) versus different frame rate. The dashed lines represent results of the two methods on all MoCap data. The bars represent results on MoCap data with different embedded noise levels (Eq. 14, all:noise free, 0.2:20% noise and 0.4:40% noise)

Fig. 5
figure 5

Accuracy comparison results on motion capture data of two interacting subjects

Fig. 6
figure 6

Comparison results on a trampoline motion sequence

Table 3 Efficiency (fps) comparison on single subject

The results are shown in Fig. 3 and Table 2. At high capture rate (120 fps), the 3D position of each marker won’t change much, so the result of UGM is as good as ours. But due to limited computing power, lower capture rate such as 60 or 30 fps is commonly used in practical applications. So we compared our method against UGM at capture rate 60, 45, 30 and 25 fps, and get the outperformed accuracy, shown in Fig. 4. The efficiency testing result is shown in Table 3.

Fig. 7
figure 7

Accuracy comparison results on a playing tennis sequence

We also test our approach on motion capture data of multiple interacting characters. By adding markers into model graph, our method can naturally be expanded to multiple subjects. The outperformed accuracy of our method compared against alternative methods is shown in Fig. 5. And the efficiency testing result is shown in Table 4.

5.2 Application: online human motion reconstruction

Based on our online marker labeling and pose reconstruction algorithm, we proposed an online motion reconstruction system. The motion capture system we used is Vicon T-series system with 12 cameras. Our system takes unlabeled 3D marker positions as input and produces reconstructed poses in real-time online manner. We compare the resulting animation of our method against Vicon and alternative labeling methods: YLD, UGM, LH and CP. The comparison results are best viewed in the supplementary video, although we show several examples in Figs. 6 and 7.

5.3 Discussion

In case of noise motion capture data with capture rate at 120 fps, the displacement of marker in successive frames is very small, the labeling accuracy of our method is obviously better than alternative methods (CP, LH, YLD), and almost equal to UGM. That is because as in CP and LH and YLD methods, as the noise level within motion capture data increasing, the rigidity of the edges cannot be kept anymore. However, UGM method will decay as the decrease of the motion capture rates because it only considers the smoothness of the marker’s trajectory. This indicates that integrated use of the soft graph matching model and the missing marker estimation scheme is helpful to resume the identification after the loss of most tracking.

Ghost markers The original motion capture data contained noise markers. We randomly generate more noise markers as ghost markers. We first specify a number (\(\alpha \)) of ghost markers and then randomly generate the positions of ghost markers as well as the appearing time. The ghost markers are generated in two different ways, Fig. 8. In the first way, ghost markers are directly generated according to original noise markers positions. In the second way, ghost markers are generated according to extra noise markers, which are sampled from the original noise marker positions with Gaussian noise N(0, 2)(cm). We test our marker labeling method on 500 randomly generated motions in both ways. And we find that even when \(\alpha =|M|\), the number of total wrong labeled markers is still less than 10. These results demonstrate the capability of rejecting a large number of ghost markers.

Facial marker labeling Unlike human body, there are no limbs in human face. As a result, local rigid constraints are invalid in facial motions as the relative distances between markers may change a lot along with the facial muscle and skin. So we only use the motion smoothness constraint to estimate the matching cost for different marker correspondences (the 2nd term in Eq. 4). To correctly label all markers at the 1st frame, similar to full-body cases, we instruct subject to perform facial motions starting from “normal” expression. Figure 9 indicates the power of our method for accurate facial marker labeling applications.

Table 4 Efficiency (fps) comparison on double subjects

6 Conclusion

In this paper, we present a new online marker labeling method for optical motion capture, which can be used for building real-time live applications. Experimental results demonstrate that the marker labeling accuracy of our method outperforms the state-of-the-art marker labeling methods especially in the cases of missing/ghost markers and low-frequency capture rates. It benefits from the integrated use of the proposed soft graph matching model and the marker estimation scheme which simultaneously considers the local geometrical structure and full pose. Although the marker labeling efficiency of our method is not the best (as rand 4 of 5 methods) due to the use of pose reconstruction process, it is still sufficient for real-time live applications.

6.1 Limitation and future work

The performance of our method becomes worse as the motion capture rate is decreased. The main reason is that current setting of the empirical weights is not optimal. In fact, when the MoCap data are down-sampled or the markers are previously occluded, the weight of the point correspondence cost should be decreased and when a marker violates rigidity constraint a lot (i.e., attached to soft tissues), the weight of the edge correspondence cost should be decreased. We plan to set an automatic or experimental scheme finding the optimal weights being suitable for various kinds of motion in future. Also, we plan to explore a robust method automatically detecting labeling failures and reinitializing the labeling process.

Fig. 8
figure 8

Labeling result on a walking sequence with 41 ghost markers. The two different ways of generating the position of ghost markers are represented in the left and right two images, respectively

Fig. 9
figure 9

Top row input unlabeled markers (white dots) of facial motion capture data. Bottom row marker labeling (colored dots) results (the green marker links are only used for viewing convenience, not the indication of rigid constraint)

Aiming to identify markers for live real-time applications, our method is easily extended to multi-actor interaction motions. Unfortunately, the estimated marker is not always accurate, especially when the markers on an arm or a leg are all occluded for a long period of time. Inaccurate estimation of missing markers may lead to the deterioration of the labeling algorithm, especially when the missing marker re-appears. As for multiple interacting characters, quality of the reconstructed motion cannot be guaranteed when the interaction becomes more intensive (for example, two people are holding each other while rolling on the ground). We plan to do further study on data-driven approach: First, when calculating Eq. 3, we implicitly assume the correspondence of the neighbor, but the label result may conflict with this assumption. This could be improved by using example data. Then, we would like to explore how to construct a reasonable statistical model from example database so that better predictions could be derived for occluded markers. Finally, reconstructing the movement of multiple intensively interacting characters is another problem worth study.