Abstract
Multi-person pose estimation is a fundamental yet challenging task in machine learning. In parallel, recent development of pose estimation has increased interests on pose tracking in recent years. In this work, we propose an efficient and powerful method to locate and track human pose. Our proposed method builds upon the state-of-the-art single person pose estimation system (Cascaded Pyramid Network), and adopts the IOU-tracker module to identify the people in the wild. We conduct experiments on the released multi-person video pose estimation benchmark (PoseTrack2018) to validate the effectiveness of our network. Our model achieves an accuracy of 80.9% on the validation and 77.1% on the test set using the Mean Average Precision (MAP) metric, an accuracy of 64.0% on the validation and 57.4% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric.
Keywords
- Pose estimation
- Pose tracking
D. Yu and K. Su—Equal contribution.
Download conference paper PDF
1 Introduction
Multi-person pose estimation and tracking are important yet challenging problems for all persons in single RGB image, which are fundamental research topics for many visual applications like human action recognition [19], human-computer interaction [6] and so on.
Recently, the performance of multi-person pose estimation on standard benchmarks such as MPII Pose [11] and COCO [12] has been greatly improved with the rapidly development of convolution neural networks [2, 4, 8, 13,14,15,16, 18, 20]. Existing methods can be classified into two kinds of approaches: the bottom-up approach and the top-down approach. The bottom-up approach detects human skeletons from all potential human candidates and then assemble these skeletons into each person. The top-down approach first adopt a detection module to get all the human boxes from the image, then apply a single-person human pose estimator to detect human skeletons. Although impressive performance has been achieved, current state-of-the-art methods still have difficulty to deal with occluded keypoints, invisible keypoints, and crowed backgroud, which cannot be well localized. Most recent pose tracking methods track the human box over the entire video in terms of similarity between pairs of boxes measured with box iou or similarity between pairs of human keypoints measured with keypoint oks distance in adjacent frames [7, 9, 21].
In this work, we propose an efficient and powerful approach to multi-person keypoint detecting and tracking in videos. For the keypoints detecting stage, we propose an enhanced cascade pyramid network to accurately locate human keypoint in each frame of a video. For the keypoint tracking stage, we employs IOU tracker which is a lightweight frame-by-frame optimization method, allowing our model to be scalable to virtually any length videos.
2 Related Work
Our proposed approach is related to previous works involving with human pose estimation and tracking, as described as follows:
Multi-person pose estimation is an important task in computer vision. Existing approaches can be divided into two categories: bottom-up approaches and top-down approaches. Bottom-up approaches firstly predict all keypoints and then assemble them into multiple persons. For example, associate embedding simultaneously predict heatmaps and tagmaps to group the predicted keypoints to different persons [13]. Top-down approaches firstly detect all human boxes in an image, and then predict the keypoints within each box independently. For example, Cascaded Pyramid Network (CPN) predicts human bounding boxes first and then solve the single person pose estimation in the cropped person patches [4]. In general, top-down approaches perform more accurate than bottom-up approaches. However, with the number of humans increases in an image, top-down approaches perform more slower.
Based on the multi-person pose estimation architectures described above, it is natural to extend them from still image to video. Some online trackers simplify this tracking problem as a maximum weight bipartite matching problem and solve it with greedy or Hungarian Algorithm. Nodes of this bipartite graph are human bounding boxes in two adjacent frames. For example, PoseTrack [7] and ArtTrack [9] in CVPR’17 primarily introduce multi-person pose tracking challenge and propose a new graph partitioning formulation, building upon 2D DeeperCut [10] by extending spatial joint graph to spatio-temporal graph.
3 Method
In this work, we take the top-down method to estimate multi-person pose in each frame. Firstly, we apply as human detector on the RGB image to generate human bounding-boxes. Secondly, we predict the detailed localization of the keypoints for each candidate human bounding-boxes by a single-person pose estimator. Finally, we simplify the tracking problem to bipartite matching the candidate bounding-boxes between a pair of frames.
3.1 Person Detector
In order to detect more people from image, we adopt the Deformable Convolutional Networks (with detection MAP of 44.4 on the COCO minival dataset) [5] and SNIPER (with detection MAP of 46.5 on the COCO minival dataset) [17] methods to generate our human bounding-boxes.
3.2 Pose Estimator
In order to get accurate person keypoints, we adopt the state-of-the-art single person pose estimator [4] (Cascade Pyramid Network) to detect the human skeletons. In addition, we have enhance the cascade pyramid network to make it more robust and accurate to handle large pose variations, changes in clothing and lighting conditions, severe body deformations, heavy body occlusions and so on. For the Global-Net, we design a shuffle unit to cross the information from all feature scales. For the Refine-Net, we design an attention unit to extract more representative feature to predict the keypoint localization.
3.3 Pose Tracker
Following the ICCV 2017 winner [7], these detections are presented as a graph, where every detected person bounding box in every frame is a node. And the edges are defined to connect each human bounding-box in a frame to each human bounding-box in the next frame. The cost of each edge is defined as the iou metric of the two human bounding-boxes linked on that edge to belong to the same person. To compute tracks, we simplify the problem to bipartite matching between a pair of frames, and propagate the labels forward, one frame at a time, starting from the first frame to the last.
4 Experiments
4.1 Dataset and Evaluation Metric
Our single person pose estimation model is trained with three datasets: MSCOCO dataset [12], AI challenge dataset [3], and PoseTrack challenge 2018 dataset [1]. MSCOCO dataset contains over 66k images with 150k people, AI challenge dataset has more than 270k images with 449k people, and PoseTrack challenge 2018 dataset contains 667 short video clips annotated for multi-person pose estimation and multi-person pose tracking.
We evaluate our proposed method on PoseTrack Challenge 2018 dataset. We use Total AP to evaluate the multi-person pose estimation results and standard MOTA metric to evaluate the tracking performance.
4.2 Training Details
Our single person pose estimation model is trained using adam algorithm with an initial learning rate of 5e-4. Note that we also decrease the learning rate by a factor of 2 every 3600000 iterations. We use a weight decay of 1e-5 and the training batch size is 32. In the training for pose estimation, 4 V100 GPUs on a GPU server are used.
4.3 Testing Details
Following same testing strategies used in CPN, we apply a gaussian filter on the predicted heatmaps. We also predict the pose of the corresponding flipped image and average the heatmaps to get the final prediction. A quarter offset in the direction from the highest score response to the second highest response is used to obtain the final location of the keypoints. In order to get the best performance on the MAP metric, we first use the SoftNMS on the candidate human bounding-boxes generated by the Deformable Convolutional Networks and SNIPER. Second, we use the Pose-OKS method with the threshold of 0.4 to filter out the redundant human keypoints. Finally, we filter out the human bounding boxes which area is smaller than 3600. In order to achieve the best performance on the MOTA metric, two more rules added. The score of human-bounding box must be higher than 0.35 and the score of the predicted keypoint must be higher than 0.85.
4.4 PoseTrack Challenge Results
We evaluate our method on the whole validation set and partial of test set of the PoseTrack challenge 2018 dataset. The performance of the MAP metric is shown in the Table 1. And, the performance of the MOTA metric is shown in Table 2. We also show some sample keypoints detection results of our model on the PoseTrack challenge 2018 dataset in Fig. 1.
5 Conclusions
In this paper, we propose an efficient and powerful method for the multi-person pose estimation and tracking. For the multi-person pose estimation, based on the Cascaded Pyramid Network, we design a shuffle unit to fuse the pyramid feature maps and an attention unit to extract more representative feature maps. For the multi-person pose tracking, we simplify the problem as a bipartite matching problem between a pair of the frames. Experimental results show that our method achieves an accuracy of 80.9% on the validation and 77.1% on the test set using the Mean Average Precision (MAP) metric, an accuracy of 64.0% on the validation and 57.4% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric.
References
PoseTrack 2018 Challenge: PoseTrack challenge 2018 dataset. https://posetrack.net/
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR, vol. 1, p. 7 (2017)
AI challenger: AI challenger dataset. https://challenger.ai/
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. arXiv preprint arXiv:1711.07319 (2017)
Dai, J., et al.: Deformable convolutional networks. CoRR, abs/1703.06211 1(2), 3 (2017)
Dix, A.: Human-computer interaction. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 1327–1331. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-39940-9_192
Girdhar, R., Gkioxari, G., Torresani, L., Paluri, M., Tran, D.: Detect-and-track: efficient pose estimation in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 350–359 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. IEEE (2017)
Insafutdinov, E., et al.: Arttrack: articulated multi-person tracking in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 4327. IEEE (2017)
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_3
MPII: Mpii human pose dataset. http://human-pose.mpi-inf.mpg.de/
MS-COCO: Coco keypoint leaderboard. http://cocodataset.org/
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems, pp. 2274–2284 (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Papandreou, G., et al.: Towards accurate multi-person pose estimation in the wild. In: CVPR, vol. 3, p. 6 (2017)
Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016)
Singh, B., Najibi, M., Davis, L.S.: SNIPER: efficient multi-scale training. arXiv preprint arXiv:1805.09300 (2018)
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 915–922. IEEE (2013)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose flow: Efficient online pose tracking. arXiv preprint arXiv:1802.00977 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, D., Su, K., Sun, J., Wang, C. (2019). Multi-person Pose Estimation for Pose Tracking with Enhanced Cascaded Pyramid Network. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11130. Springer, Cham. https://doi.org/10.1007/978-3-030-11012-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-11012-3_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11011-6
Online ISBN: 978-3-030-11012-3
eBook Packages: Computer ScienceComputer Science (R0)