1 Introduction

Multi-person pose estimation and tracking are important yet challenging problems for all persons in single RGB image, which are fundamental research topics for many visual applications like human action recognition [19], human-computer interaction [6] and so on.

Recently, the performance of multi-person pose estimation on standard benchmarks such as MPII Pose [11] and COCO [12] has been greatly improved with the rapidly development of convolution neural networks [2, 4, 8, 13,14,15,16, 18, 20]. Existing methods can be classified into two kinds of approaches: the bottom-up approach and the top-down approach. The bottom-up approach detects human skeletons from all potential human candidates and then assemble these skeletons into each person. The top-down approach first adopt a detection module to get all the human boxes from the image, then apply a single-person human pose estimator to detect human skeletons. Although impressive performance has been achieved, current state-of-the-art methods still have difficulty to deal with occluded keypoints, invisible keypoints, and crowed backgroud, which cannot be well localized. Most recent pose tracking methods track the human box over the entire video in terms of similarity between pairs of boxes measured with box iou or similarity between pairs of human keypoints measured with keypoint oks distance in adjacent frames [7, 9, 21].

In this work, we propose an efficient and powerful approach to multi-person keypoint detecting and tracking in videos. For the keypoints detecting stage, we propose an enhanced cascade pyramid network to accurately locate human keypoint in each frame of a video. For the keypoint tracking stage, we employs IOU tracker which is a lightweight frame-by-frame optimization method, allowing our model to be scalable to virtually any length videos.

2 Related Work

Our proposed approach is related to previous works involving with human pose estimation and tracking, as described as follows:

Multi-person pose estimation is an important task in computer vision. Existing approaches can be divided into two categories: bottom-up approaches and top-down approaches. Bottom-up approaches firstly predict all keypoints and then assemble them into multiple persons. For example, associate embedding simultaneously predict heatmaps and tagmaps to group the predicted keypoints to different persons [13]. Top-down approaches firstly detect all human boxes in an image, and then predict the keypoints within each box independently. For example, Cascaded Pyramid Network (CPN) predicts human bounding boxes first and then solve the single person pose estimation in the cropped person patches [4]. In general, top-down approaches perform more accurate than bottom-up approaches. However, with the number of humans increases in an image, top-down approaches perform more slower.

Based on the multi-person pose estimation architectures described above, it is natural to extend them from still image to video. Some online trackers simplify this tracking problem as a maximum weight bipartite matching problem and solve it with greedy or Hungarian Algorithm. Nodes of this bipartite graph are human bounding boxes in two adjacent frames. For example, PoseTrack [7] and ArtTrack [9] in CVPR’17 primarily introduce multi-person pose tracking challenge and propose a new graph partitioning formulation, building upon 2D DeeperCut [10] by extending spatial joint graph to spatio-temporal graph.

3 Method

In this work, we take the top-down method to estimate multi-person pose in each frame. Firstly, we apply as human detector on the RGB image to generate human bounding-boxes. Secondly, we predict the detailed localization of the keypoints for each candidate human bounding-boxes by a single-person pose estimator. Finally, we simplify the tracking problem to bipartite matching the candidate bounding-boxes between a pair of frames.

3.1 Person Detector

In order to detect more people from image, we adopt the Deformable Convolutional Networks (with detection MAP of 44.4 on the COCO minival dataset) [5] and SNIPER (with detection MAP of 46.5 on the COCO minival dataset) [17] methods to generate our human bounding-boxes.

3.2 Pose Estimator

In order to get accurate person keypoints, we adopt the state-of-the-art single person pose estimator [4] (Cascade Pyramid Network) to detect the human skeletons. In addition, we have enhance the cascade pyramid network to make it more robust and accurate to handle large pose variations, changes in clothing and lighting conditions, severe body deformations, heavy body occlusions and so on. For the Global-Net, we design a shuffle unit to cross the information from all feature scales. For the Refine-Net, we design an attention unit to extract more representative feature to predict the keypoint localization.

3.3 Pose Tracker

Following the ICCV 2017 winner [7], these detections are presented as a graph, where every detected person bounding box in every frame is a node. And the edges are defined to connect each human bounding-box in a frame to each human bounding-box in the next frame. The cost of each edge is defined as the iou metric of the two human bounding-boxes linked on that edge to belong to the same person. To compute tracks, we simplify the problem to bipartite matching between a pair of frames, and propagate the labels forward, one frame at a time, starting from the first frame to the last.

4 Experiments

4.1 Dataset and Evaluation Metric

Our single person pose estimation model is trained with three datasets: MSCOCO dataset [12], AI challenge dataset [3], and PoseTrack challenge 2018 dataset [1]. MSCOCO dataset contains over 66k images with 150k people, AI challenge dataset has more than 270k images with 449k people, and PoseTrack challenge 2018 dataset contains 667 short video clips annotated for multi-person pose estimation and multi-person pose tracking.

We evaluate our proposed method on PoseTrack Challenge 2018 dataset. We use Total AP to evaluate the multi-person pose estimation results and standard MOTA metric to evaluate the tracking performance.

Table 1. The performance of the MAP metric on PoseTrack challenge 2018 dataset.
Table 2. The performance of the MOTA metric on PoseTrack challenge 2018 dataset.

4.2 Training Details

Our single person pose estimation model is trained using adam algorithm with an initial learning rate of 5e-4. Note that we also decrease the learning rate by a factor of 2 every 3600000 iterations. We use a weight decay of 1e-5 and the training batch size is 32. In the training for pose estimation, 4 V100 GPUs on a GPU server are used.

4.3 Testing Details

Following same testing strategies used in CPN, we apply a gaussian filter on the predicted heatmaps. We also predict the pose of the corresponding flipped image and average the heatmaps to get the final prediction. A quarter offset in the direction from the highest score response to the second highest response is used to obtain the final location of the keypoints. In order to get the best performance on the MAP metric, we first use the SoftNMS on the candidate human bounding-boxes generated by the Deformable Convolutional Networks and SNIPER. Second, we use the Pose-OKS method with the threshold of 0.4 to filter out the redundant human keypoints. Finally, we filter out the human bounding boxes which area is smaller than 3600. In order to achieve the best performance on the MOTA metric, two more rules added. The score of human-bounding box must be higher than 0.35 and the score of the predicted keypoint must be higher than 0.85.

4.4 PoseTrack Challenge Results

We evaluate our method on the whole validation set and partial of test set of the PoseTrack challenge 2018 dataset. The performance of the MAP metric is shown in the Table 1. And, the performance of the MOTA metric is shown in Table 2. We also show some sample keypoints detection results of our model on the PoseTrack challenge 2018 dataset in Fig. 1.

Fig. 1.
figure 1

Some results of our model on the PoseTrack challenge 2018 dataset.

5 Conclusions

In this paper, we propose an efficient and powerful method for the multi-person pose estimation and tracking. For the multi-person pose estimation, based on the Cascaded Pyramid Network, we design a shuffle unit to fuse the pyramid feature maps and an attention unit to extract more representative feature maps. For the multi-person pose tracking, we simplify the problem as a bipartite matching problem between a pair of the frames. Experimental results show that our method achieves an accuracy of 80.9% on the validation and 77.1% on the test set using the Mean Average Precision (MAP) metric, an accuracy of 64.0% on the validation and 57.4% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric.