Recovering human pose has become increasingly important in the field of sports, animation, human–computer interaction, video surveillance, action recognition and so on. The 2D pose is inherently ambiguous as one 2D pose is projected from a lot of 3D poses. The vast potential of estimating 3D human pose in sports has attracted academic interests. In figure skating, jump is widely acknowledged as one of the most critical elements of the figure skater’s program. The excellent jumping is attractive for the audience, but puts forward strict technical requirements for the athletes. Analyzing 3D jump in figure skating not only objectively evaluates the performance of figure skater, but also enhances the audience’s entertainment experience. Therefore, the analysis of 3D jump pose should not be overlooked as it plays a significant role for figure skater’s behavior understanding.

Why 3D pose with multi-perspective is interesting? First, an increasing number of efforts in the community neither the end-to-end method [1,2,3] nor the two-step method [4,5,6] are focused on monocular 3D pose [7, 8] based on the success of deep learning. Although many methods settle down to figure out a relative depth based on a reference point or a root joint, then according to the prior information to calculate the final 3D pose. The obtained 3D result is not the real world coordinate in the space. The multi-perspective structure [9,10,11,12] satisfies the requirement of annotation for monocular 3D pose estimation because of the high accuracy and efficient process. Second, the multi-perspective 3D pose estimation has become widespread, motivated by some practical purposes. It is indisputable that in the practical applications, multi-perspective will avoid the problem of dead ends. Naturally, a higher accuracy can be achieved compared to single-view in most cases.

Fig. 1
figure 1

Abnormal pose and large scale venue bring in challenges

The goal of 3D human pose estimation in figure skating is to localize key points of body in 3D space. However, most of the previous 3D human pose estimation methods losing sight of the limitations from the big venue and the small target of figure skating. The diverse variations in background, costume, abnormal pose, self-occlusion, athletic fields and camera parameters make the 3D jump pose estimation a challenging problem.

This work summarizes two main difficulties as shown in Fig. 1. Figure skating as a kind of sport combines athletic power with elegant artistry, contains abnormal pose which is different from daily motion. First, the abnormal pose means there are several confusing limbs, which are difficult to identify even for human eyes. Second, the figure skating is always held in the large venue. The moving range of the athlete is big. In some situation the target is far from the camera and projects unclear image content which is difficult to detect.

In this paper, we develop a transformation system to generate figure skater’s 3D pose conditioned on the corresponding multi-view 2D Part Confidence Map [13] which indicates the probability that the joint may appear in the current map area, and attempt to effectively handle the 3D pose estimation in large venue via a binocular stereo reconstruction architecture for jump analysis in figure skating.

Related works

Multi-view 3D pose estimation

Iskakov et al. [14] presented two novel solutions for multi-view 3D human pose estimation based on new learnable triangulation methods that combine 3D information from multiple 2D views. Pavlakos et al. [15] presented an automatic way to gather 3D annotations for human pose estimation tasks, using a generic ConvNet for 2D pose estimation and recordings from a multi-view setup. Remelli et al. [16] proposed a new multi-view fusion technique for 3D pose estimation that is capable of reasoning across multiview geometry effectively, while introducing negligible computational overhead with respect to monocular methods. Tome et al. [17] proposed a CNN-based approach for multi-camera markerless motion capture of the human body. Their approach makes use of 3D reasoning throughout a multistage approach. Liang et al. [18] proposed a scalable neural network framework to reconstruct the 3D mesh of a human body from multi-view images, which focuses more on the human body shape rather than the human joints. Ohashi et al. [19] discussed 3D reconstruction of human motion from multi-camera images. Luo et al. [20] proposed a multi-task and multi-level neural network structure with physical constraint to estimate 3D human poses from single RGB image in an end-to-end. This work has limitations on abnormal pose and large venue, which are important feature of figure skating. After the Part Confidence Maps are computed from each camera image, the proposed spatial–temporal filter is applied to deliver the human motion data with accuracy and smoothness for human motion analysis. The conventional work [21] of this paper has developed a system to obtain the 3D jump pose of a figure skater. At the core of the approach, this method corrects inaccurate or even erroneous reconstruction results by combining spatial–temporal information and a multi-perspective during the process of 2D-to-3D pose transformation.

Sensor-based jump analysis in figure skating

Advancements in wearable technology have facilitated performance monitoring in a number of sports. Many sensor-based methods [22, 23] provide valuable analysis of figure skating.

An inertial sensor-based system (MISSIE) [24] has sampled the jumps and filmed with high speed video. With respect to time values of figure skating jump events toe pick, release of glide leg, take off and landing inertial data were manually and software algorithm analysed, and video data were manually analysed. MISSIE can be used for figure skating jump analysis and feedback, being superior to traditional video analysis. Another study [25] developed a prototype jump monitor for figure skating. The accurate identification of multi-revolution jumps and quantification of rotation speeds can be accomplished using a single waist-mounted IMU. The purpose of this study was to evaluate the feasibility of using an inertial measurement unit (IMU) to monitor figure skating jumping performance.

Manual-observation-based jump analysis in figure skating

Many figure skating analysts judge athletes’ performance based on the images or the videos of the competition [26, 27]. The successful execution of jumps can be determined through observing frame by frame.

Work [28] reviews the biomechanics of triple and quadruple figure skating jumps, focusing on information that has implications for strength and conditioning programs. By observing the motion of figure skater fully, they have summarized that to complete the required revolutions in a jump, a skater must balance the average angular velocity with the time in the air. Study [29] also tests an elite male junior skater. The skater performed a series of jumps on the ice, they observed substantial differences in the movement technique and kinematic parameters of the pre-take-off phase in jump performance.

Comparing to the previous studies of jump analysis in figure skating, this work has two attractive characteristics. First, the method allows hassle-free and no burdensome implementation, which means the figure skaters do not have to carry any measuring instruments. Second, this work focuses on using computer vision techniques to extract and analyze the figure skater’s 3D pose instead of observing the original figure skating video manually. All in all, the proposed method can not only be applied to real competitions, but also can obtain the figure skater’s pose automatically and efficiently.

Framework of the reconstruction system

The system defines a complete jump as four consecutive stages, glide leg (S1), take off (S2), spin in the air (S3) and landing (S4) as shown in Fig. 2.

Fig. 2
figure 2

Four stages of the flip jump in figure skating

Figure 4 illustrates the overall pipeline of the proposed approaches. First, as shown in Fig. 4, the video sequences are captured from six perspectives. The part confidence map of each joint is obtained using OpenPose [13], which is a 2D gesture recognition method. The part confidence map reflects the possibility of the 2D appearance area of each joint, which is explained clearly in OpenPose [13] program. Figure 4 gives three views to explain how to choose discrete probability points. From left to right, the higher intensity the confidence map, the 2D recognition of this joint is considered to be more accurate. Then, some discrete probability points are selected according to the likelihood distribution and the temporal smoothness. Here, likelihood distribution means the higher temperature in the heatmap, more points will be selected. Here the heatmap of the first confidence map is the lowest, so choose one point (the yellow point) to represent the 2D position of this joint, and the third confidence map has the highest temperature, so choose more points (the blue points) to represent the 2D position of the joint. So far, the 2D position of a joint in multiple perspectives has been represented by some discrete points, respectively.

Fig. 3
figure 3

Figure skating venue size. Six cameras are placed every 60 degrees on the auditorium and the visual fields of them cover the red area simultaneously

Fig. 4
figure 4

Framework of the reconstruction system. The system takes six synchronized figure skating sequences of single person as input, and outputs the 3D human skeleton estimation results. The overall framework of the system can be roughly divided into three steps, and the three proposals will work in their respective steps

Second, two views are arbitrarily selected from multi-view. The joint is numbered according to the confidence from different viewpoints. In the reconstruct spatial confidence point group part, take the first orange picture as an example, one perspective uses three green dots to represent the 2D position of the joint in this perspective, and the other perspective uses a yellow dot to represent the 2D position of the joint in this perspective. When using binocular reconstruction, each point in the first view will calculate a 3D point with each point in the other view. Therefore, the orange picture will get three 3D points (three by one), the green picture will get five 3D points (five by one), the blue picture will get fifteen 3D points (five by three). Here the color setting is consistent with the 2D heatmap. Then, the 3D confidences of the reconstructed 3D points are determined according to the number of points. The number of points in the orange picture is 3, so its confidence is not as high as the confidence level of the spatial confidence point group with 5 points in the green picture.

Third, multiple spatial confidence point groups of a certain joint will be merged into a large spatial point group. Then, figure skater’s 3D pose is estimated by analyzing the spatial confidence point groups and collecting the constraint statistics based on the prior conditions. As a matter of fact, since many 2D pose estimation approaches contain probability regions of body parts, this work depends on any other 2D pose detector which has confidence map of the joints as an intermediate result.

Fig. 5
figure 5

Basic flow of discrete probability point selection. The color arrows from left to right, from blue to red, represent a gradual decrease in confidence. Extract more points from the higher-intensity part confidence map, fewer points are extracted with low intensity. The lower heatmap has two part confidence peaks. By referring the temporal information, the unique position of the right elbow position can be decided. Temporal smoothing helps to find the only confidence map in each view to represent a certain joint in the case of ambiguity

Fig. 6
figure 6

Likelihood distribution-based discrete probability points selection. The higher the intensity of the confidence map, the higher probability that the recognition result is positive. The lower the intensity, the higher probability that the recognition result is negative. From left to right, the intensity of the confidence map gradually decreases, choose more points for higher confidence and fewer points for low confidence

Likelihood distribution and temporal smoothness-based discrete probability points selection likelihood

In this example as shown in Fig. 5, the purpose is to select a certain number of points in each view based on the probability to replace the figure skater’s right elbow. In the top image, the confidence map of the right elbow has the strongest intensity. In the middle one, the confidence map is slightly weaker, and in the lower one, the right elbow is misrecognized. The two part confidence maps representing the right elbow are generated with noise. For each perspective, no matter how many part confidence maps represent the right elbow, some discrete points are extracted from them to represent right elbow. All in all, the discrete probability points are selected based on the heat distribution of the confidence map. The point groups which don’t conform to the spatiotemporal condition are filtered. Finally, a certain number of discrete points can be obtained in each view to represent the position of the right elbow.

Likelihood distribution

Figure 6 shows that the confidence degree gradually decreased from left to right (the color bias toward blue represents high confidence, and toward red represents low confidence). Therefore, the number of the discrete probability points corresponding to different confidence maps with different intensities gradually sparse naturally. The reason for this selection is to increase the influence of high-intensity confidence maps while weakening the influence of low-intensity confidence maps. These probability point groups depend on quantity to play their role in the subsequent processing.

Fig. 7
figure 7

Joints number in human model

Fig. 8
figure 8

Temporal smoothness-based discrete probability points selection. Here shows a continuous 2D movement of a joint in a certain viewpoint(movement sequence is from right to left). After the previous operation, the position of the joint at each moment in this perspective appears as a 2D points group. Here eight moments are chosen as an example, of which three moments appear in the two-point group situation. The relationship between point groups is divided into three cases. In this example, there are two case1 (orange and green points groups processing process), two case2 (processing of pink and golden points groups), and one case3 (the blue points group). Due to the use of temporal smoothness, the processing of the current point group needs to refer to the intensity characteristics of the previous one

Temporal smoothness

Due to some confusing limbs of figure skater and unclear target caused by large venue, often there are ambiguous identification results. In some cases, a joint has more than one part confidence map from a certain perspective, and also corresponds to two discrete point groups. As shown in Fig. 7, in general, this situation often occurs in the body part which has the right and the left difference (e.g., the neck won’t cause confusion, but the hands will). To solve this problem, this work proposes a 2D temporal filter base on the smoothness of the body part trajectory.

As shown in Fig. 8, after likelihood distribution, the confidence map of a series of consecutive frames taken from one perspective will first be presented in the form of discrete probability points. When the ambiguous confidence maps occur, the data of the previous frame should be taken into consideration. Temporal smoothing aims to achieve the uniqueness of the point group.

For each frame, the previous frame is the processed group, so the previous frame is always unique, assuming its highest score point as a reference:

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_{t-1}}=\left( u_{\mathrm{frame}_{t-1}}^n,v_{\mathrm{frame}_{t-1}}^n,s_{\mathrm{frame}_{t-1}}^n\right) , \end{aligned}$$

where k is the joint number, following the rules in Fig. 7. u and v is the pixel coordinate, s is the confidence score. n represents the nth point in the current point group. The highest score in the current frame’s point groups is defined as:

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_{t_0}}&=\left( u_{\mathrm{frame}_{t_0}}^n,v_{\mathrm{frame}_{t_0}}^n,s_{\mathrm{frame}_{t_0}}^n\right) \end{aligned}$$
$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}&=\left( u_{\mathrm{frame}_{t_1}}^n,v_{\mathrm{frame}_{t_1}}^n,s_{\mathrm{frame}_{t_1}}^n\right) . \end{aligned}$$

where \(\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}\) only exists when there are ambiguous results. If there doesn’t exist double point groups, the highest score is marked as \(\mathrm{joint}^{k_n}_{\mathrm{frame}_t}\) in the current frame’s point group.

There are three cases in which the previous frame handles the current frame. The first case is retaining the closer point group with the original intensity. The previous frame of this type is a high-intensity point group, it means that the error probability of the previous frame is relatively low. It’s obvious that the closer one between \(\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_0}}\) and \(\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}\) that is more consistent with the continuity to the reference point group is considered as a valid result \(\mathrm{joint}^{k_n}_{\mathrm{frame}_t}\) from the comprehensible mathematical bases. Therefore, the final point group of this frame is

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_t}=\left( u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,s_{\mathrm{frame}_t}^n\right) . \end{aligned}$$

The second case is reducing intensity of current frame’s point group. The previous frame of this type is a low-intensity point group, and is not a double point groups itself. It is more noteworthy that the continuity between the two frames is out of bounds. When the previous frame cannot provide reliable guarantee and the relationship between the two frames is not continuous, the point group of the current frame should be downgraded as much as possible. Therefore, the final point group of this frame is

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_t}=\Bigg (u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,0.100\Bigg ). \end{aligned}$$

The third case is retaining the closer point group and reducing intensity. This method is used to solve the problem that the current frame belongs to the double point groups and the previous frame is of low intensity. When the continuity of the relationship between frames is not strong, it is forced to select a point group between \(\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_0}}\) and \(\mathrm{joint}^{k_n}_{\mathrm{frame}_{t_1}}\) closer to the previous frame. A selected group \(\mathrm{joint}^{k_n}_{\mathrm{frame}_t}\) is demoted to reduce its influence in the overall situation. So the final point group of this frame is

$$\begin{aligned} \mathrm{joint}^{k_n}_{\mathrm{frame}_t}=\Bigg (u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,0.100\Bigg ). \end{aligned}$$

Multi-perspective and combination unification-based large-scale venue 3D reconstruction

Binocular stereo vision mimics human eyes to obtain 3D information and consists of two cameras. The two cameras form a triangular relationship with the measured object in space. As shown in Fig. 9, the spatial coordinate can be obtained according to the calibration matrix and the pixel values of the two camera planes.

Fig. 9
figure 9

Binocular stereo vision structure. First, camera calibration theory is used to calculate the relationship between space point and the corresponding pixel point. The binocular can estimate the depth value of the space point. Based on the camera calibration matrix and the 2D pixel projection values of the space point in the two viewing angles, 3D point coordinates in space can be calculated


This work uses six cameras to capture 3D pose as the dead angle of the single-view created by the athlete’s 360-degree rotation had to be taken into account. The graduated color bar represents different confidence intensities which is weakening from left to right as shown in Fig. 10.

Fig. 10
figure 10

The reconstructed spatial confidence point group. The quantity of the points in spatial confidence point group depends on the amount of the selected discrete probability points in confidence map. For example, in the gray picture on the left, the first perspective contains 5 discrete probability points and the second perspective contains 3 discrete probability points. Then, the number of the points in this spatial confidence point group is \(3 \times 5\)

Fig. 11
figure 11

Binocular reconstruction requires obvious angle. When there is no obvious triangular relationship between the two perspectives, its reconstruction results will be biased

Fig. 12
figure 12

The reconstructed trajectories from multi-view. The 3D trajectories under different combinations is shown here. It can be seen that the shape of the trajectories are roughly similar, but there is a certain degree of position offset

Reviewing the result of proposal one which provides the discrete point in each 2D point group is \(\mathrm{joint}^{k_n}_{\mathrm{frame}_t}=(u_{\mathrm{frame}_t}^n,v_{\mathrm{frame}_t}^n,s_{\mathrm{frame}_t}^n)\). After reconstruction from multiple perspectives, the discrete points in each spatial confidence point group is

$$\begin{aligned} \begin{aligned} {\left\langle L,R \right\rangle } \mathrm{joint}^{k_m}_{\mathrm{frame}_t}&=\Bigg ({\left\langle L,R \right\rangle }x_{\mathrm{frame}_t}^{k_m},{\left\langle L,R \right\rangle }y_{\mathrm{frame}_t}^{k_m},\\&\qquad {\left\langle L,R \right\rangle }z_{\mathrm{frame}_t}^{k_m},{\left\langle L,R \right\rangle }s_{\mathrm{frame}_t}^{k_m}\Bigg ) \end{aligned} \end{aligned}$$

L and R indicate which two viewpoints are selected, those value range is camera 0 to camera 5. m is the mth 3D points in the spatial confidence point group. \(s_{\mathrm{frame}_t}^m\) is the average confidence value of the selected 2D discrete points from two cameras.

Since six cameras are set every 60 degrees and the recovering 3D shape is mainly based on the binocular stereo reconstruction, the six camera setting means 15 combinations (\({\left\langle 0,1 \right\rangle }\),\({\left\langle 0,2 \right\rangle }\),\({\left\langle 0,3 \right\rangle }\),\({\left\langle 0,4 \right\rangle }\),\({\left\langle 0,5 \right\rangle }\),\({\left\langle 1,2 \right\rangle }\),\({\left\langle 1,3 \right\rangle }\),

\({\left\langle 1,4 \right\rangle }\),\({\left\langle 1,5 \right\rangle }\),\({\left\langle 2,3 \right\rangle }\),\({\left\langle 2,4 \right\rangle }\),\({\left\langle 2,5 \right\rangle }\),\({\left\langle 3,4 \right\rangle }\),\({\left\langle 3,5 \right\rangle }\),\({\left\langle 4,5 \right\rangle }\)). Due to the large area of the figure skating site and the complexity of the camera reconstruction results, how to choose and calculate the appropriate reconstruction combination is shown in Fig. 11. When the athlete moves to some position, the camera’s lens plane and the target can’t generate obvious angle which means there is deviation existing.

Without the hassle, combinations \({\left\langle 0,1 \right\rangle }\), \({\left\langle 2,5 \right\rangle }\), \({\left\langle 3,4 \right\rangle }\) can be judged as unusable, because the angles with the target is always close to 180 degrees, which means the corresponding rays of these combinations are nearly parallel. The rest of the combinations are considered valuable. However, the subtle differences still exists as shown in Fig. 12. The next step is to unify the trajectories based on the reconstructed spatial confidence point groups.

Camera combinations unification

For each joint, it will have several spatial confidence point groups with different intensities. To unify these groups, we select a root joint in advance and calculate the root joint’s position first. Then, the spatial confidence point group of other common joints are translated based on the root joint.

For the purpose of unifying the combination values, this work selects the neck joints of combination \({\left\langle 2,4 \right\rangle }\) or \({\left\langle 3,5 \right\rangle }\), in which there is no occlusion problem and the spatial confidence point group is always strong as the root joint. Here taking the combination \({\left\langle 2,4 \right\rangle }\) as the example:

$$\begin{aligned} \begin{aligned} {\left\langle 2,4 \right\rangle }\mathrm{joint}^1_{\mathrm{frame}_t}&= \left( \dfrac{\sum ^{r=m}_{r=1}{\left\langle 2,4 \right\rangle }x_{\mathrm{frame}_t}^{1_m}}{m},\right. \\&\qquad \dfrac{\sum ^{r=m}_{r=1}{\left\langle 2,4 \right\rangle }y_{\mathrm{frame}_t}^{1_m}}{m},\\&\left. \qquad \dfrac{\sum ^{r=m}_{r=1}{\left\langle 2,4 \right\rangle }z_{\mathrm{frame}_t}^{1_m}}{m}\right) . \end{aligned} \end{aligned}$$

The three components in brackets are recorded as \(x_{\mathrm{root}_t}\), \(y_{\mathrm{root}_t}\), \(z_{\mathrm{root}_t}\). Then, every 3D point in different camera combinations’ spatial confidence point group belonging to one certain joint k needs to be reset as \(S^{k_m}_{\mathrm{frame}_t}\) according to the following formula:

$$\begin{aligned} S^{k_m}_{\mathrm{frame}_t}= \left\{ \begin{array}{lr} {\left\langle L,R \right\rangle }x_{\mathrm{frame}_t}^{k_m}-{\left\langle L,R \right\rangle }x_{\mathrm{frame}_t}^{1_m}+x_{\mathrm{root}_t},\\ {\left\langle L,R \right\rangle }y_{\mathrm{frame}_t}^{k_m}-{\left\langle L,R \right\rangle }y_{\mathrm{frame}_t}^{1_m}+y_{\mathrm{root}_t},\\ {\left\langle L,R \right\rangle }z_{\mathrm{frame}_t}^{k_m}-{\left\langle L,R \right\rangle }z_{\mathrm{frame}_t}^{1_m}+y_{\mathrm{root}_t} \end{array} \right. \end{aligned}$$

So far, the results from all available camera combinations have been unified as shown in Fig. 13

Fig. 13
figure 13

The camera combinations are unified. The neck joint is considered the root joint. The common joint here is one of any joints other than the neck. Each common joint has several spatial confidence point groups from different camera combinations. According to the relative distance between it and the root joint of a specific camera combination, all the point groups are transformed into a relatively concentrated area. It is consistent with the fact that the 3D position of the joint should be unchanged no matter which two perspectives are calculated

Multi-constraint-based human skeleton estimation

After the first two proposed methods, one or more spatial confidence point groups belonging to each joint can be obtained. For the purpose of determining one unique joint 3D position, this work attempts to specify a reasonable and valuable mechanism by setting out constraints to choose the final spatial 3D point.

Priors based on body structure and motion trend

What kind of priors and how to obtain them are prerequisites for the implementation of this proposal. Here we mainly discuss the function of bone length and motion trend angle in the constrained superposition process.

To ensure the verisimilitude of human body, the length prior can be obtained from measuring the human body. However, when there is no excessive deviation in the reconstruction accuracy, it is possible to obtain the prior which is not different from manual measurement by selecting the joint value after supervision. This is more versatile, because no limitation to the athlete. The athletic motion trends of figure skaters are mentioned here to illustrate the motivation behind the subsequent employ of temporal information. The continuity of action leads to the close connection of the relationship between frames. The stability of the motion trend provides a shortcut for using the inter-frame connection. This method will make more detailed corrections according to the characteristics of the slight change of the limb angle between the frames. All in all, the length of human skeleton and the continuity of human motion are considered to constrain the calculation results.

Fig. 14
figure 14

Output form of the joints. For each joint, it is expected to output a 3D point in space. These joints will form a human skeletal structure, and multiple frames will form the jumping poses of the figure skater


After the reconstruction of spatial confidence point groups, each joint has multiple candidate groups. the system hopes to process all point groups into specific spatial locations as shown in Fig. 14.

As shown in Fig. 15, length constraint \(l^{p\_{c_m}}_{\mathrm{frame}_t}\), angle constraint \(\theta ^{p\_{c_m}}_{\mathrm{frame}_t}\) and confidence constraint \(\mathrm{score}_{\mathrm{frame}_t}^{c_m}\) are added to filter out errors and select the most accurate 3D position:

$$\begin{aligned} \left\{ \begin{array}{ll} l^{p\_{c_m}}_{\mathrm{frame}_t},\\ \theta ^{p\_{c_m}}_{\mathrm{frame}_t},\\ \mathrm{score}_{\mathrm{frame}_t}^{c_m}, \end{array} \right. \end{aligned}$$
Fig. 15
figure 15


First, whether spatial confidence point groups is eligible to participate in weight assignment. There is no doubt that starting with length filter for points that exceed the limit can help eliminate a wide range of errors. Supposing \(\mathrm{flag}_{p\_{c_m}}\) representing the prior bone length of the parent joint and the child joint. Then, \(S^{c_m}_{\mathrm{frame}_t}\) representing the spatial coordinate and the corresponding confidence score for a certain point in all the spatial confidence points group of joint c:

$$\begin{aligned} S^{c_m}_{\mathrm{frame}_t}=\left( x_{\mathrm{frame}_t}^{c_m},y_{\mathrm{frame}_t}^{c_m},z_{\mathrm{frame}_t}^{c_m},s_{frame_t}^{c_m}\right) . \end{aligned}$$

The length distance of the parent and child joints is

$$\begin{aligned} \begin{aligned} (l^{p\_{c_m}}_{\mathrm{frame}_t})^{2}&= \left( x_{\mathrm{frame}_t}^{c_m}-x_{\mathrm{frame}_t}^{p_m}\right) ^{2}\\&\quad +\left( y_{\mathrm{frame}_t}^{c_m}-y_{\mathrm{frame}_t}^{p_m}\right) ^{2}\\&\quad +\left( z_{frame_t}^{c_m}-z_{\mathrm{frame}_t}^{p_m}\right) ^{2}. \end{aligned} \end{aligned}$$

If \(l^{p\_{c_m}}_{\mathrm{frame}_t}\) in the range of \(\mathrm{flag}_{p\_{c_m}}\pm 50.00\) centimeters, it is considered that it can enter the screening of the next constraint, otherwise it will directly discard.

Second, we consider the changes in limb angle and confidence score in parallel to locate the final position of the joint. The angle between the current frame’s limb and the previous frame’s limb is

$$\begin{aligned} \theta ^{p\_{c_m}}_{\mathrm{frame}_t}=\arccos \frac{\overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_t}}\cdot \overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_{t-1}}}}{{\left|\overrightarrow{v^{p\_{c_m}}_{frame_t}}||\overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_{t-1}}}\right|}}, \end{aligned}$$

where \(\overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_t}}\) is the vector of the parent and child joints:

$$\begin{aligned} \overrightarrow{v^{p\_{c_m}}_{\mathrm{frame}_t}}= & {} \Bigg ( x^{{c_m}}_{{\mathrm{frame}_t}}-x^{p}_{{\mathrm{frame}_t}}, y^{{c_m}}_{{\mathrm{frame}_t}}-y^{p}_{{\mathrm{frame}_t}},\nonumber \\&z^{{c_m}}_{{\mathrm{frame}_t}}-z^{p}_{{\mathrm{frame}_t}} \Bigg ) \end{aligned}$$

Third, different weights are added of different intensities in spatial confidence point groups, and marked as \(\mathrm{score}_{\mathrm{frame}_t}^{c_m}\). Angle constraint \(\theta ^{p\_{c_m}}_{\mathrm{frame}_t}\) and confidence constraint \(\mathrm{score}_{\mathrm{frame}_t}^{c_m}\) will cooperate to take effect and consider together. The tolerance of the angle gradually expands in a divergent way, searching for high-confidence points \(S^{c_n}_{\mathrm{frame}_t}\) in spatial confidence point groups within the divergence range to jointly calculate the final spatial position. Here, the number of points with the same score belonging to one joint is denoted as n. The difference between m is that n is a subset of m. Similar to the concept of weighted average, the final spatial coordinate \(C^c_{\mathrm{frame}_t}\) is

$$\begin{aligned} C^c_{\mathrm{frame}_t}= \left\{ \begin{array}{ll} \dfrac{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot x_{\mathrm{frame}_t}^{c_n}}\Bigg )}{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot n}\Bigg )},\\ \dfrac{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot y_{\mathrm{frame}_t}^{c_n}}\Bigg )}{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot n}\Bigg )},\\ \dfrac{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot z_{\mathrm{frame}_t}^{c_n}}\Bigg )}{\sum \Bigg ({\mathrm{score}_{\mathrm{frame}_t}^{c_n}\cdot n}\Bigg )},\\ \end{array} \right. \end{aligned}$$

Experiment result

Data set and experimental environment

The resource videos of the experiments records the figure skating scene with six cameras. In theory, the more cameras are used, the better performance the algorithm can achieve and the more computation time required. To decide the camera number, we do verification test of different camera number and concluded that at least 6 cameras should be used to ensure the accuracy of the algorithm. Details of the verification test are shown in the following subsection.

For each camera, the resolution is \(1920 \times 1080\), the frame-rate is 60 pfs and the shutter speed is 0.001 s, so that there is little motion blur in the image. The test sequences contain single Flip jump, double Flip jump and triple Flip jump.

The experiment is executed with the following environment setting: the CPU is Intel Core i7-3770, the RAM is 8 GB, the compiler is Visual Studio 2017, and the external includes OpenCV-3.4.1.

Verification test of camera number

Due to the equipment limitations, this article only shows the experiment to discuss the conditions when cameras number equals or less than 6. To achieve the precise 3D pose of human, at least two cameras are required.

In figure skating, the spin pose of body is important to analysis the jump action. The experiment simulates the span scene. As Fig. 16 shows, the span angle is divided into 12 states.

Fig. 16
figure 16

In the experiments of camera number, cameras are placed camera every 60 degrees

Taking the camera 1 as an example, the joints recognition of athlete at each angle are observed in Fig. 17, where the joint number are marked in Fig. 7.

Fig. 17
figure 17

The observation of joints at each angle in camera 1

As Fig. 17 shows, when the span angle of athlete is position \(\textcircled {1}\), facing the camera, all the joints are well seen and there are no misidentified joints. When the athlete turns to position \(\textcircled {3}\), he basically faces the camera sideways, so his right arm has a certain degree of self-occlusion, and the three joints of right shoulder, right elbow and right wrist (joints 2, 3 and 4, respectively) are identified incorrectly. When the athlete turns to the angle of position \(\textcircled {6}\), he is back to the camera. At this time, both of his arms are self-occluded, so the wrist joints of the left and right hands (joints 4 and 7) are incorrectly identified. The rest positions are similar. Taking 6 cameras into consideration, the observation results are summarized in Fig. 18. The six cameras are named from C1 to C6. In addition, each column represents the span angle of the athlete. The element (C1, \(\textcircled {3}\)) represents only the joint 2,3,4 can be observed in camera C1 at the position \(\textcircled {3}\). Under the first column, the value 4 (2) marked by an arrow mark means the joint 4 can only be correctly observed by 2 cameras at position \(\textcircled {1}\).

From the Fig. 18, it can be concluded that when 6 cameras are used, some joints in a specific position can only be observed by two cameras. Therefore, to ensure the accuracy of the algorithm, the camera number cannot be less than 6.

Evaluation method

This work considers the evaluation protocol which is the per-joint position error in millimeters. As shown in Fig. 19, first we manually label human joint pixels from six perspectives, then the artificial labeling coordinates are reconstructed and statistical processing is performed to obtain groundtruth.

Fig. 18
figure 18

Summary of the observation condition for each joints in different cameras. The element (Cx, \(\textcircled {y}\)) represents the joints which can be observed by camera x at the position \(\textcircled {y}\)

It can’t be ignored that the large field of figure skating and small targets cause slight fluctuations at the pixel level, which cause centimeter-level errors in the real world. Therefore, in terms of error tolerance, not only the error range, but also the groundtruth error which caused by manual annotation needs to be considered. Therefore, this paper evaluates the results by both qualitative and quantitative methods.

First is the qualitative evaluation method. In this work, the ground truth range is taken as the center part of the sphere with a radius of error range. The result is defined as successful if the distance between the calculated coordinate and the center part of the sphere doesn’t exceed error range. The formulas for calculating the success rate is defined as the number of success joints divided by the number of total joints:

$$\begin{aligned} \mathrm{Success rate}= \frac{\mathrm{Successful joints}}{\mathrm{Total joints}}. \end{aligned}$$

As for the quantitative evaluation method, the mean per joint position error (MPJPE) are calculated. The error is calculated as

$$\begin{aligned} E_{\mathrm{MPJPE}} = \frac{1}{N_j} \sum ^{N_j}_{i=1}\left| S_{\mathrm{GT}}^{i} - S_{\mathrm{result}}^{i} \right| \end{aligned}$$

where the \(S_{\mathrm{GT}}^{i}\) and \(S_{\mathrm{result}}^{i}\) are the ground truth and the estimated result of the \(i_{th}\) joints. The \(N_j\) is the total number of human body joints, whose value is 13.

Experimental results

Comparison items are shown in Table 1. Comparison items are shown in Table 1. The proposed method here is compared with basic framework and conventional work. Although the comparison items use the same modules which are 2D information extraction, reconstruction and 3D pose correction, each of them exploits a different strategy for different module. The strategies used in proposed method has already been fully explained. The methods employed in basic framework and conventional work can be simply described as follows. In the 2D information extraction module, the basic framework just uses the recognized joint 2D pixels value from the existing 2D pose estimator directly. The conventional work adds temporal smoothness on it to filter out wrong points. In the 3D reconstruction module, the basic framework and conventional work both calculates the average value of the reconstruction results from multiple cameras as the fusion result. They all ignored the deviation between the 3D reconstruction points from multiple cameras. In the 3D correction module, the basic framework does not modify 3D skeleton, and the conventional work only modifies some unreasonable bone lengths based on the physiological human joint length as a prior. In the experiment part, only the end-to-end analysis results are presented. It is because that the three proposals are not independent with each other and no module can be replaced in the whole framework. Therefore, it is difficult to conduct ablation experiments here to demonstrate the improvement providing by each individual proposal.

The experimental results are shown in Table 2. As shown in Fig. 7, Upper represents the joints of the upper body (joint numbers are 0–7), and Lower represents the joints of the lower body (joint numbers are 8–13). The table lists the success rate of the upper body, lower body, and whole body joints within the specified error range. Compared with conventional work, the accuracy rates of the proposed work are almost all above 90% and the MPJPE value is significantly reduced from 74.12 to 23.57 mm. This is due to the usage of spatial confidence point groups to determine the possible position of joints. Then, human skeleton is generated in combination with temporal constraints. At the 2D level, conventional work uses the highest confidence point from the partial confidence map as the 2D position of the point, while the proposed method uses discrete probability points instead of a fixed highest probability point as the 2D position, which can avoid 2D error recognition. In reconstruction part, the conventional work calculates the reconstruction average value of different camera combinations from six perspectives as the final 3D result, while the proposed method generates a spatial confidence point group and unify all combinations based on the relative relationship with the root joint. In terms of 3D constraints, compared with traditional work, the proposed method also adds motion trend constraints and confidence constraints in addition to the human bone length constraints.

Fig. 19
figure 19

Evaluation method

Table 1 Comparison items
Table 2 Experiment results

Consideration and analysis

The method proposed in this paper still has some problems unsolved. First of all, from the experimental results, the accuracy is not completely perfect. The large venue of figure skating is a very influential factor. Although the preparation of the data set has focused on the local field as much as possible, the more than 100 square meters field of view and fixed cameras positions caused many limitations. This is not only detrimental to the accuracy of camera calibration, but also to the accuracy of the binocular reconstruction of small target.

Besides, because this work is implemented in stages, the accuracy of 2D estimator plays a very crucial role. This work uses OpenPose network to estimate the 2D pose and the part confidence map. However, there are still many 2D misrecognitions, since OpenPose is weak at estimation of abnormal pose, which is the typical pose in figure skating. The 2D misrecognitions of pose estimation network directly affect the final reconstruction results and limit the upper bounder of proposals’ accuracy. To achieve more accurate 3D pose, a 2D pose estimation method specifically for figure skating is expected.

Moreover, if the system is applied to a real competition, the processing speed cannot be ignored. At present, the initial focus of this work is to make it possible to extract the 3D pose of figure skaters. In the future, realizing real-time processing while ensuring the accuracy is a necessary condition.

Fig. 20
figure 20

Examples of experiment results


This work focuses on multi-view 3D human pose reconstruction which based on synchronized sequences. The proposed methods formulate the problem as 2D pose correction followed by 3D pose reconstruction with taking the advantages of figure skating’s particularity. The proposal starts with the 2D confidence map and presents a multi-technology correction solution based on the likelihood distribution and temporal information. Then, 3D reconstruction is conducted by taking fully consideration of the large-scale venue. Using multi-camera to capture the motion of the athlete is more conductive to motion understanding. And finally refining and narrowing down the spatial confidence point group via multi-constraint allow modeling a human pose prior in advance. In comparison to the basic framework and the conventional work [21], the success rate of the independent joint is generally more than 90%.

At present, OpenPose as a 2D human pose estimation network is used to obtain the confidence map of the joints.Although OpenPose works well in general conditions, it still contains many 2D misrecognitions for figure skating. To achieve more accurate 3D pose, a 2D pose estimation method specifically for figure skating is expected. In the future, the possible ways to improve the accuracy of 2D pose are pre-labeling a batch of skating data sets as training data and training a 2D network model to better identify skating positions. Moreover, adding key points such as hands and feet is also significant. At the same time, improving the algorithm to realize real-time while ensuring the accuracy is also important for real application. With these modifications, the system is expected to be applied to the objectively performance evaluation of figure skaters and real-time display of figure skating TV broadcast.