3D hypothesis clustering for cross-view matching in multi-person motion capture

We present a multiview method for markerless motion capture of multiple people. The main challenge in this problem is to determine cross-view correspondences for the 2D joints in the presence of noise. We propose a 3D hypothesis clustering technique to solve this problem. The core idea is to transform joint matching in 2D space into a clustering problem in a 3D hypothesis space. In this way, evidence from photometric appearance, multiview geometry, and bone length can be integrated to solve the clustering problem efficiently and robustly. Each cluster encodes a set of matched 2D joints for the same person across different views, from which the 3D joints can be effectively inferred. We then assemble the inferred 3D joints to form full-body skeletons for all persons in a bottom–up way. Our experiments demonstrate the robustness of our approach even in challenging cases with heavy occlusion, closely interacting people, and few cameras. We have evaluated our method on many datasets, and our results show that it has significantly lower estimation errors than many state-of-the-art methods.


Introduction
Multi-person motion capture estimates the articulated joint positions and/or angles for a group of people from video.It is an important yet challenging task with many applications, such as human-computer interaction, action recognition, emotion analysis, human performance analysis, and so on.The latest work shows that markerless motion capture is feasible for a single person in weakly controlled environments, but is very difficult for a group of people in uncontrolled environments, due to the increased complexity in occlusion, appearance, motion, shape, and scale.
In this paper, we focus on markerless motion capture for multiple people with a multiview setup.Past approaches typically solve this problem in two stages.The first stage detects 2D body keypoints or pose in each view for all persons, and the second stage matches them across views to reconstruct 3D poses.As deep-learning based keypoint and pose detection techniques have greatly advanced [1][2][3][4][5][6], the remaining challenge is to resolve the correspondences between detected keypoints or poses across different views and different persons.Most previous methods employ a 3D pictorial structure (3DPS) model to implicitly solve the correspondence problem by reasoning about all hypotheses in 3D that are geometrically compatible with the detected 2D information [7][8][9][10][11].However, 3DPS-based approaches are computationally expensive due to the huge state space.In addition, they are not robust, especially when there are few cameras, as they link the 2D detected joints only based on multiview geometry, and appearance cues are ignored.
This paper presents a 3D hypothesis clustering technique to efficiently and robustly determine the cross-view correspondences between the detected joints.The proposed technique transforms the correspondence problem from 2D space to 3D, and solves it by a 3D hypothesis clustering algorithm incorporating appearance evidence, multiview geometry, and bone length information.Each resulting cluster is a set of 3D points, which corresponds to a set of matched 2D detections of the same person in different views.The 3D joints can then be inferred from the matched 2D joints; full skeletons for all persons are formed by assembling the inferred 3D joints.
We have tested our algorithm on a number of benchmark datasets, including Panoptic [12], Shelf, and Campus [8].The high accuracy of our method is shown by a quantitative comparison to many stateof-the-art methods.Some results of our method are shown in Fig. 1, while the Electronic Supplementary Material (ESM) provides more vivid results.
The main contributions of this paper are:

Related work
There have been many approaches to 3D pose estimation for multiple people with a multiview setup.Most previous works are based on 3DPS models in which nodes represent 3D locations of body joints and edges encode pairwise relations between them [7][8][9][10][11].The state space for each joint is often a 3D voxel grid obtained by discretizing a 3D space.
The node-likelihood score of each voxel is computed by projecting the center of the voxel to all views and taking the average of the scores at the projected positions.Pairwise potentials between joints are given by skeletal constraints [8,9] or body parts detected in 2D views [10,11].3D poses of multiple people are jointly inferred by a posteriori estimation.In contrast to our approach, most of these approaches require the number of people to be known.Additionally, all these approaches consider all body joints for all people simultaneously, resulting in a huge state space and a large amount of computation needed for inferencing.Moreover, these approaches are usually sensitive to the number of cameras and noise in the 2D detections, as they implicitly match the 2D detections across views using only multiview geometry.
A recent approach first explicitly matches the detected 2D poses and then reconstructs 3D poses from the 2D poses belonging to each person [13].It uses OpenPose of Cao et al. [4] as the single-view pose detector, and solves the cross-view 2D pose matching problem based on epipolar geometry.A limitation of this method is that simple epipolar geometry verification can be error-prone as the detected 2D poses are often incomplete and inaccurate due to occlusion and truncation.Furthermore, multiperson 2D pose parsing is performed in each view independently, which is time-consuming and ignores complementary information from other views.Also, as each possible pair of poses from two views are matched separately, inconsistent correspondences may result, i.e., two matched poses from two views may be associated with different people in another view.Such inconsistencies will lead to serious errors in 3D pose reconstruction.Dong et al. [14] address these challenges with a convex optimization based multi-way matching algorithm, which determines correspondences for all views at once.In their work, a human detector [15] and a state-of-the-art single-person pose estimator [16] are applied to each view to estimate multi-person 2D poses.In the matching stage, a person re-identification network [17] is integrated to exploit the appearance information.This method achieves excellent performance, but is very time-consuming as it integrates so many network modules.Furthermore, pose-level 2D detection usually cannot provide reliable estimation in scenes with many closely interacting people [18], making it difficult to find correct correspondences between 2D poses.
In contrast, our approach solves the correspondence problem at joint level for all views at once, by combining multiview appearance cues, geometric constraints, and bone length constraints.The advantages are: firstly, only joint-level 2D detection is needed, which is robust to occlusion and truncation.Secondly, it performs pose parsing in 3D space only once rather than many times in 2D, which is more efficient and more reliable as complementary information from all views are exploited.Thirdly, it naturally avoids inconsistent correspondences as all views are considered simultaneously.

Approach
Our approach to multi-person 3D skeleton reconstruction is overviewed in Fig. 2. It takes as input a set of images from multiple calibrated and synchronized video cameras (Fig. 2(a)), and outputs 3D skeletons for all people in the scene (Fig. 2(h)).The skeleton for each person comprises 14 joints, and 13 limbs as shown in Fig. 3 (the dashed limb connections do not have constant bone length).
In the following sections, we assume a setup with C calibrated cameras, providing input images {I 1 , . . ., I C } with corresponding projection matrices {P 1 , . . ., P C }.

2D joint candidate detection
We detect 2D joint candidates in each image using the network proposed by Cao et al. [4].It takes an RGB image as input, and outputs a set of confidence maps for each joint (Fig. 2(b)) and a set of part affinity fields (PAFs) for each limb (Fig. 2(c)).
Let H i,j denote the confidence map for joint j in image I i .Given H i,j , the candidates of joint j for all people can be obtained by performing nonmaximum suppression (NMS), yielding a set of 2D points as shown in Fig. 2(d), denoted φ i,j = {x k i,j , k ∈ {1, . . ., n i,j }}.These detections can be noisy, due to occlusion, motion blur, ambiguity due to symmetry, distorted poses, etc. Extracting 3D joint positions directly from these detections is unreliable.
The PAFs are used to measure the confidence that two joint candidates form a limb of one person.Consider two 2D joint candidates in image I i : x i,a and x i,b , where joint indices a and b correspond to a limb in Fig. 3.The connectivity score that they come from the same person, s(x i,a , x i,b ), is computed by integrating the dot product of the corresponding PAF and the unit vector from x i,a to x i,b over the line segment between them (see Eq. ( 10) in Ref. [4] for details).Note that this score may be unreliable in a single view due to partial visibility of people or close proximity of similar looking people.In Section 3.2.1,we introduce a connectivity score in 3D space for reliability (Eq.( 1)).
The next two steps, 2D joint candidate matching across views (Section 3.2) and 3D joint candidate Fig. 2 Overview.The input is a set of synchronized images from multiple views (a); the final output comprises 3D poses in global world space (h).Each individual is followed over time marked with a fixed color as in (g) and (h).reconstruction (Section 3.3) are sequentially performed on the joints in the order shown on the skeleton model in Fig. 3.Note that the order used may affect the cross-view joint matching procedure.It is possible that optimizing the order might improve the result, but the fixed order used in our current implementation works well.

2D joint candidate matching between views
2D joint candidate matching aims to find joint correspondences for each person across all views, which can be formulated as a multi-way matching problem.Classical approaches to this problem include spanning tree optimization [19], sampling inconsistent cycles [20], low-rank matrix recovery [21], etc.Most of them define feature similarity from 2D images and solve the problem with a complex optimization algorithm to guarantee cycle consistency.In our setup, all views are calibrated and synchronized, so it is possible and valuable to explore 3D cues for reliability and efficiency.Thus, we propose a 3D hypothesis clustering technique to match noisy 2D candidates across views, combining multiview geometric constraints, appearance cues, and bone length constraints.In this way, cycle consistency is naturally guaranteed.

3D hypothesis space construction
Consider joint j.Given the 2D candidates detected in each view, we triangulate each possible candidate pair from every pair of views, yielding a set of 3D points, forming a 3D hypothesis space.Let Λ j be the hypothesis space for joint j, consisting of a number of 3D points; see Fig. 4(a).Many wrong hypotheses exist stemming from the triangulation of 2D candidates which are wrongly detected or from different individuals, as their identities are not known.A basic way to detect outliers is to check if the reprojection error exceeds a fixed threshold τ 1 (τ 1 = 8 pixels in our experiments).However, some wrong hypotheses may pass this test provided that the two candidate locations satisfy the epipolar geometry constraint, as shown in Fig. 4(b).To detect such wrong hypotheses, we impose two further constraints on them.The first constraint is PAF based.If there exists a limb: p → j between joint j and its parent joint p, then the 3D point X j ∈ Λ j must have a high connectivity score with one of the 3D parent candidates, denoted X p , X p ∈ Φ p .Specifically, we define the connectivity score in 3D space by combining multiple views.Consider two 3D candidates: X a and X b , where joint indices a and b correspond to a limb in Fig. 3.The connectivity score between X a and X b , S(X a , X b ), is computed by averaging the 2D connectivity score of the projected points in all views: Then the constraint is defined as where τ 2 = 0.3 in our experiments.
The second constraint is bone length based.If p → j is a length-constant limb, then the distance between X p and X j must approximately equal the corresponding bone length of one person.It is defined as where L p,j is the length set of the limb p → j from all persons, which are assumed to be known as priors for now; Section 3.6 will present an automatic method to quickly compute them.τ 3 = 0.1 in our experiments.
Combining the two constraints along with the reprojection error constraint, the 3D hypothesis space after removing outliers is shown in Fig. 4(c).

3D hypothesis space clustering
The hypothesis space for each joint is shared between all persons (Fig. 4(c)).3D hypothesis space clustering aims to separate the hypothesis space into a component for each person.We adopt DBSCAN (density-based spatial clustering for applications with noise) [22]; other clustering algorithms which estimate the number of clusters could also be used.Figure 4(d) shows the clustered 3D hypothesis space.The 3D points in a given cluster correspond to a single person.The corresponding 2D joint candidates contributing the 3D points in a given cluster are naturally matched across views, yielding a single 3D joint candidate.
One challenge in cross-view matching is that the positions of some joints from different people can be very close (e.g., the right wrist when two people clap right hands with each other, as in the first example in Fig. 1).In such cases the clustering algorithm may be unable to distinguish them and so groups them into a single cluster, so one 3D candidate is missing.To solve this problem, if the center of a cluster has a plausible distance from more than one parent 3D candidate according to the corresponding bone length set, we split the cluster by re-grouping its 3D points according to their different optimum parent joints with reasonable bone length.

3D joint candidate reconstruction
Consider joint j again.Once a set of 2D joint candidates {x i 1 ,j , . . ., x i n ,j } has been matched across views {i 1 , . . ., i n }, we can determine a 3D point X j with a least-square optimization procedure, which minimizes the sum of reprojection error in the views.
It is known that least-squares are very sensitive to outliers, especially if there are very few input values, like in our case with few cameras.Thus, we use a weighted cost function based on the confidence in the 2D candidates: In this way, we give more importance to reprojection errors for good 2D candidates.This cost function is optimized using the Levenberg-Marquardt leastsquares algorithm [23,24].The reconstructed 3D point X j is a candidate for the jth anatomical joint of one person, which we refer to as a 3D joint candidate for joint j. Figure 2(e) shows the reconstructed 3D joint candidates for all joints, for all persons.

3D pose parsing
Given the reconstructed 3D joint candidates (Fig. 2(e)), pose parsing aims to generate a full-body pose for each person by assembling the corresponding joint candidates.We extend the method proposed by Cao et al. [4] for 2D multi-person pose estimation: we perform pose parsing in 3D space instead of 2D.In turn, the connectivity score between possibly connected candidates is computed in 3D space as in Eq. ( 1).Moreover, each limb connection candidate must have the correct distance according to bone length.As more evidencs are integrated from multiple views and bone lengths, our extended method is robust to occlusion with significantly overlapping people, etc. Figure 2(f) shows the parsed 3D poses for all persons.

Pose tracking
Since the above steps reconstruct multiple 3D poses for each time step independently, we use a pose tracking method to obtain the pose trajectories over time.In our setup, 3D pose for each person is reliably reconstructed, so is used for robust tracking.
Given the pose predictions in 3D space, we link them in time to obtain trajectories, which can be seen as a data association problem over these predictions.We compute tracks by simplifying this problem to bipartite matching between each pair of adjacent frames (Figs.2(f) and 2(g)).We initialize the tracks in the first frame and propagate labels forward using a greedy method, one frame at a time.The similarity weight for two poses in adjacent frames is measured by the average distance between torso joints (head, neck, left/right shoulder, left/right hip) in 3D space.Any pose that is not matched to an existing track instantiates a new track.The temporally consistent poses are shown in Figs.2(g) and 2(h), where poses for the same person are marked in the same color.

Bone length estimation
To simplify system setup, we use an automatic technique to estimate bone lengths of all persons to be captured.As shown in Fig. 3, the skeleton has 11 length-constant bones.We estimate their lengths on-the-fly.First, we construct initial 3D poses for all persons in a similar way to that described above but without considering bone length information, and track them for a few seconds, producing hundreds of initial 3D poses for each subject.Then, for each subject, we compute bone lengths from each pose, yielding a number of length approximations for each bone.Though the distribution of the length approximations can be noisy, most approximations are near the ground truth.Then for each bone, we use the median length as the final estimate of length.This automatic method enables our motion capture system to conveniently adapt to different subjects without any subject-specific priors.

Experimental results
We have evaluated our proposed method on several benchmark datasets: Panoptic [12], Shelf, and Campus [8].In the experiments, no ground truth data was used for training from the respective datasets.We rely solely on a generic 2D pose CNN (trained on the MPII dataset [25]) and knowledge of the geometry of the calibrated camera setup.Figure 5 shows some results for these datasets, demonstrating that our method can produce high-quality 3D skeletons in general scenarios with many interacting people (7 people in the 7th example), large variability in human physique (see the toddler in the 3rd example), severe occlusion (see the 4th example), and low-resolution images (360 × 288) using only 3 cameras (the 1st example).uses 300 frames, and on Campus, 220 frames, following previous works [9,10,14].The evaluation metric is the percentage of correct parts (PCP) with α = 0.5.

Evaluation on benchmark datasets
We compare our method to the state-of-the-art methods in Refs.[9,10,14].Specifically, Belagiannis et al. [9] used a factor graph optimized with belief propagation to perform inference on 2D poses detected by DeepPose [1].Ershadi-Nasab et al. [10] used Deepercut [3] as the 2D joint detector, defined a fully connected conditional random field, and then used loopy belief propagation for approximate inferencing.Dong et al. [14] is a very recent method, which combines a human detector [15] and a singleperson pose estimator [16] to obtain multi-person 2D poses in all views, and then uses a convex optimization based multi-way matching algorithm to match detected poses across views, from which the 3D pose of each person is inferred.Table 1 quantitatively reports the comparison results.
Both Refs.[9] and [10] are 3DPS model-based approaches.As Belagiannis et al. [9] did not use the most recent heatmap-based keypoint detectors, which perform better than DeepPose [1] (a direct coordinate regression-based detector).For a fair comparison, we also report the result in Ref. [14] which re-implements the approach of Belagiannis et al. [9] using the same state-of-the-art 2D keypoint detector [4] as our method, indicated as Belagiannis et al. * .
The results show that our method outperforms both 3DPS model-based methods [9,10] by a large margin on the Campus dataset (3.61%) and achieves a substantial improvement on the Shelf dataset (1.47%).In particular, our approach significantly improves results for actor 3 in the Campus dataset, who suffers from heavy occlusion with only 3 cameras.The main reason is that the 3DPS model-based method implicitly matches 2D detections only using multiview geometry and ignores appearance cues.When the number of cameras is small or heavy occlusion exists in most views, multiview geometry consistency alone is sometimes insufficient to distinguish correct and incorrect correspondences, leading to incorrect 3D pose.Our approach explicitly leverages appearance cues (3D PAF score), multiview geometric constraints, and bone length constraints to find cross-view correspondences, yielding much more robust results.Moreover, our method avoids a complex inference procedure in a huge state space, which can be very time-consuming.
Our approach performs better or on par with the method of Ref. [14], in which, like in our method, 3D poses are reconstructed by direct triangulation.The advantages of our approach are as follows.Firstly, it is very compact and efficient as it avoids the many network modules in Ref. [14] (a human detector [15], a single-person pose detector [16], and a person reidentification CNN [17]): only a 2D pose CNN [4] is needed.Secondly, it is more robust to heavy occlusion and closely interacting people.In such cases, detection of 2D pose is usually unreliable, making it difficult to find correct correspondences between 2D poses.Our approach largely alleviates this problem as detection of 2D joints is usually feasible [18], and the fully skeletal poses are parsed in 3D space by integrating complementary information from all views.Thirdly, instead of requiring a complex convex optimization algorithm to guarantee cycle-consistency in cross-view matching, our approach uses a novel 3D hypothesis clustering algorithm, in which cycleconsistency is naturally guaranteed.
Panoptic is a large-scale indoor dataset [12], including 480 VGA video streams and 31 HD video streams of multiple people (more than 7) in social activities.We selected data from 5 activities (Toddler, Haggling, Band, Pizza, and Bang) for evaluation.For each activity, we selected 1 or 2 sub-sequences of 40 seconds (1200 frames), from HD cameras.To assess errors on this dataset, we used the mean per joint position error (MPJPE) as the evaluation metric.
Table 2 reports the error for each body joint as well

Ablation study
This section analyzes the impact of (i) the three constraints used in outlier detection (Section 3.2.1)and (ii) the number of views used for 3D pose reconstruction, using the Panoptic dataset.

Outlier detection constraints
We use three constraints to detect 3D hypothesis outliers, based on reprojection error (RE), PAF (PAF), and bone length (BL).We now test the impact on performance of these constraints, using the challenging sequence Pizza from the Panoptic dataset.It contains 6 people with frequent occlusion.The baseline is to detect outliers solely based on reprojection error.
To show how well bone length is preserved, we introduce a new error metric, mean bone length standard deviation (BoneStd).It measures stability of bone length; smaller is better.We compute BoneStd as follows: for each bone of each tracked subject, we compute the standard deviation of its lengths over all tested frames, and then average this over all bones and all subjects.
Table 3 shows the evaluation results.Compared with results only based on reprojection error, PAFbased and BL-based constraints each by itself improve estimation quality significantly.The best results are achieved when all three constraints are combined,

Analysis on varying number of views
To evaluate the impact of the number of views, we performed our method using a varying number of views (6,8,12,16) for several sub-sequences from the Panoptic dataset, with differing numbers of people (2,3,5,6,7).The views were sampled in an approximately uniform way.
As Table 4 shows, if a small number of views is used, the results have higher errors.If the number of views is fixed, the error depends on the complexity of the scene: the number of people involved, frequency of occlusion, etc.If the scene is simpler (e.g., with 2 or 3 people), a smaller number of views (e.g., 6 cameras) suffices.Overall, error cease to improve with around 12 views.As expected, more views benefit accurate motion capture, and help more if the scene contains more people.

Multi-person 3D pose dataset creation
We created a small-scale dataset with naturally interacting people performing daily activities both indoors and outdoors.It contains more than 20,000 frames captured from 8 camera views, covering daily activities and sports, such as walking, a discussion, a party, playing musical instruments, etc.Several sub-sequences are included in the ESM.

Conclusions
This paper presents a markerless motion capture method for multiple people, which fuses cues from 2D joint detection, multiview geometry, and autoestimated bone length.Our key difference from previous methods lies in a novel 3D hypothesis clustering technique to match 2D joint detections across views, which is robust to noise and limited cameras.Multi-person pose parsing is performed in 3D space instead of 2D, so that complementary information from multiple views can be exploited, improving both efficiency and reliability.
A major advantage of our method is that it is fully automatic and does not need subject-specific priors.The number of people in the scenes is automatically determined and may vary during capture.Our experiments demonstrate the robustness of our method to challenging issues such as significant occlusion, largely overlapping people, and closely interacting joints.

Fig. 1
Fig. 1 Indoor and outdoor examples captured using our method.

Shelf
and Campus datasets [8] contain 2-4 individuals captured by multiple cameras, indoors and outdoors.Our evaluation on the Shelf dataset

Fig. 5
Fig.5 Illustrative results from our method on Campus, Shelf, and Panoptic datasets.

Table 4
Comparison of 3D error (MPJPE in cm) using varying number of views on different sub-sequences.The number of people of each sequence is in the bracket