Abstract
Despite significant developments in 3D multi-view multi-person (3D MM) tracking, current frameworks separately target footprint tracking, or pose tracking. Frameworks designed for the former cannot be used for the latter, because they directly obtain 3D positions on the ground plane via a homography projection, which is inapplicable to 3D poses above the ground. In contrast, frameworks designed for pose tracking generally isolate multi-view and multi-frame associations and may not be sufficiently robust for footprint tracking, which utilizes fewer key points than pose tracking, weakening multi-view association cues in a single frame. This study presents a unified multi-view multi-person tracking framework to bridge the gap between footprint tracking and pose tracking. Without additional modifications, the framework can adopt monocular 2D bounding boxes and 2D poses as its input to produce robust 3D trajectories for multiple persons. Importantly, multi-frame and multi-view information are jointly employed to improve association and triangulation. Our framework is shown to provide state-of-the-art performance on the Campus and Shelf datasets for 3D pose tracking, with comparable results on the WILDTRACK and MMPTRACK datasets for 3D footprint tracking.
![](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs41095-023-0334-8/MediaObjects/41095_2023_334_Fig1_HTML.jpg)
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Black, J.; Ellis, T. Multi camera image tracking. Image and Vision Computing Vol. 24, No. 11, 1256–1267, 2006.
Sternig, S.; Mauthner, T.; Irschara, A.; Roth, P. M.; Bischof, H. Multi-camera multi-object tracking by robust hough-based homography projections. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 1689–1696, 2011.
He, Y. H.; Wei, X.; Hong, X. P.; Shi, W. W.; Gong, Y. H. Multi-target multi-camera tracking by tracklet-to-target assignment. IEEE Transactions on Image Processing Vol. 29, 5191–5205, 2020.
Chen, H.; Guo, P. F.; Li, P. F.; Lee, G. H.; Chirikjian, G. Multi-person 3D pose estimation in crowded scenes based on multi-view geometry. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12348. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 541–557, 2020.
Chen, L.; Ai, H. Z.; Chen, R.; Zhuang, Z. J.; Liu, S. Cross-view tracking for multi-human 3D pose estimation at over 100 FPS. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3276–3285, 2020.
Ohashi, T.; Ikegami, Y.; Nakamura, Y. Synergetic reconstruction from 2D pose and 3D motion for wide-space multi-person video motion capture in the wild. Image and Vision Computing Vol. 104, 104028, 2020.
Dong, J. T.; Fang, Q.; Jiang, W.; Yang, Y. R.; Huang, Q. X.; Bao, H. J.; Zhou, X. W. Fast and robust multi-person 3D pose estimation and tracking from multiple views. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 10, 6981–6992, 2022.
Zhang, Y. F.; Wang, C. Y.; Wang, X. G.; Liu, W. Y.; Zeng, W. J. VoxelTrack: Multi-person 3D human pose estimation and tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 2, 2613–2626, 2023.
Wen, L. Y.; Lei, Z.; Chang, M. C.; Qi, H. G.; Lyu, S. W. Multi-camera multi-target tracking with space-timeview hyper-graph. International Journal of Computer Vision Vol. 122, No. 2, 313–333, 2017.
Köhl, P.; Specker, A.; Schumann, A.; Beyerer, J. The MTA dataset for multi target multi camera pedestrian tracking by weighted distance aggregation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4489–4498, 2020.
Canton-Ferrer, C.; Casas, J. R.; Pardàs, M. Towards a Bayesian approach to robust finding correspondences in multiple view geometry environments. In: Computational Science–ICCS 2005. Lecture Notes in Computer Science, Vol. 3515. Sunderam, V. S.; van Albada, G. D.; Sloot, P. M. A.; Dongarra, J. J. Eds. Springer Berlin Heidelberg, 281–289, 2005.
Fleuret, F.; Berclaz, J.; Lengagne, R.; Fua, P. Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 30, No. 2, 267–282, 2008.
Belagiannis, V.; Amin, S.; Andriluka, M.; Schiele, B.; Navab, N.; Ilic, S. 3D pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 38, No. 10, 1929–1942, 2016.
Yang, F.; Wang, Z.; Wu, Y.; Sakti, S.; Nakamura, S. Tackling multiple object tracking with complicated motions—Re-designing the integration of motion and appearance. Image and Vision Computing Vol. 124, 104514, 2022.
Zeng, F. G.; Dong, B.; Zhang, Y. A.; Wang, T. C.; Zhang, X. Y.; Wei, Y. C. MOTR: End-to-end multiple-object tracking with transformer. In: Computer Vision - ECCV 2022. Lecture Notes in Computer Science, Vol. 13687. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 659–675, 2022.
Zhou, X. Y.; Yin, T. W.; Koltun, V.; Krähenbühl, P. Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8761–8770, 2022.
Du, Y. H.; Zhao, Z. C.; Song, Y.; Zhao, Y. Y.; Su, F.; Gong, T.; Meng, H. Y. StrongSORT: Make DeepSORT great again. arXiv preprint arXiv:2202.13514, 2022.
Yang, F.; Odashima, S.; Masui, S.; Jiang, S. Hard to track objects with irregular motions and similar appearances? Make it easier by buffering the matching space. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4788–4797, 2023.
Giancola, S.; Cioppa, A.; Deliège, A.; Magera, F.; Somers, V.; Kang, L.; Zhou, X.; Barnich, O.; De Vleeschouwer, C.; Alahi, A.; et al. SoccerNet 2022 challenges results. In: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, 75–86, 2022.
Dong, J. T.; Jiang, W.; Huang, Q. X.; Bao, H. J.; Zhou, X. W. Fast and robust multi-person 3D pose estimation from multiple views. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7784–7793, 2019.
Leal-Taixé, L.; Pons-Moll, G.; Rosenhahn, B. Branch-and-price global optimization for multi-view multitarget tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1987–1994, 2012.
Zhang, Y. X.; An, L.; Yu, T.; Li, X.; Li, K.; Liu, Y. B. 4D association graph for realtime multi-person motion capture using multiple video cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1321–1330, 2020.
Bewley, A.; Ge, Z. Y.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In: Proceedings of the IEEE International Conference on Image Processing, 3464–3468, 2016.
Roberts, S. J. Parametric and non-parametric unsuper-vised cluster analysis. Pattern Recognition Vol. 30, No. 2, 261–272, 1997.
Andrew, A. M. Multiple view geometry in computer vision. Kybernetes Vol. 30, Nos. 9/10, 1333–1341, 2001.
Fischler, M. A.; Bolles, R. C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM Vol. 24, No. 6, 381–395, 1981.
Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable triangulation of human pose. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7717–7726, 2019.
Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. WIREs Data Mining and Knowledge Discovery Vol. 2, No. 1, 86–97, 2012.
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In: Proceedings of the IEEE International Conference on Image Processing, 3645–3649, 2017.
Chavdarova, T.; Baqué, P.; Bouquet, S.; Maksai, A.; Jose, C.; Bagautdinov, T.; Lettry, L.; Fua, P.; Van Gool, L.; Fleuret, F. WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5030–5039, 2018.
Han, X. T.; You, Q. Z.; Wang, C. Y.; Zhang, Z. Z.; Chu, P.; Hu, H. D.; Wang, J.; Liu, Z. C. MMPTRACK: Large-scale densely annotated multi-camera multiple people tracking benchmark. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 4849–4858, 2023.
Nguyen, D. M. H.; Henschel, R.; Rosenhahn, B.; Sonntag, D.; Swoboda, P. LMGP: Lifted multicut meets geometry projections for multi-camera multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8856–8865, 2022.
Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
Milan, A.; Leal-Taixe, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
Ong, J.; Vo, B. T.; Vo, B. N.; Kim, D. Y.; Nordholm, S. A Bayesian filter for multi-view 3D multi-object tracking with occlusion handling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 5, 2246–2263, 2022.
You, Q. Z.; Jiang, H. Real-time 3D deep multi-camera tracking. arXiv preprint arXiv:2003.11753, 2020.
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The CLEAR MOT metrics. Journal on Image and Video Processing Vol. 2008, 1, 2008.
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multitarget, multi-camera tracking. In: Computer Vision–ECCV 2016 Workshops. Lecture Notes in Computer-Science, Vol. 9914. Hua, G.; Jégou, H. Eds. Springer Cham, 17–35, 2016.
Ge, Z.; Liu, S. T.; Wang, F.; Li, Z. M.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv preprint arXiv:2107.08430, 2021.
Luo, H.; Jiang, W.; Gu, Y. Z.; Liu, F. X.; Liao, X. Y.; Lai, S. Q.; Gu, J. Y. A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia Vol. 22, No. 10, 2597–2609, 2020.
Belagiannis, V.; Amin, S.; Andriluka, M.; Schiele, B.; Navab, N.; Ilic, S. 3D pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1669–1676, 2014.
Belagiannis, V.; Wang, X. C.; Schiele, B.; Fua, P.; Ilic, S.; Navab, N. Multiple human pose estimation with temporally consistent 3D pictorial structures. In: Computer Vision - ECCV 2014 Workshops. Lecture Notes in Computer Science, Vol. 8925. Agapito, L.; Bronstein, M.; Rother, C. Eds. Springer Cham, 742–754, 2015.
Ershadi-Nasab, S.; Noury, E.; Kasaei, S.; Sanaei, E. Multiple human 3D pose estimation from multiview images. Multimedia Tools and Applications Vol. 77, No. 12, 15573–15601, 2018.
Ye, H.; Zhu, W.; Wang, C.; Wu, R.; Wang, Y. Faster VoxelPose: Real-time 3D human pose estimation by orthographic projection. In: Computer Vision–ECCV 2022. Lecture Notes in Computer Science, Vol. 13666. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 142–159, 2022.
Tu, H. Y.; Wang, C. Y.; Zeng, W. J. VoxelPose: Towards multi-camera 3D human pose estimation in wild environment. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 197–212, 2020.
Cao, Z.; Simon, T.; Wei, S. H.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1302–1310, 2017.
Sha, Z. J.; Zeng, Z. L.; Wang, Z.; Natori, Y.; Taniguchi, Y.; Satoh, S. Progressive domain adaptation for robot vision person re-identification. In: Proceedings of the 28th ACM International Conference on Multimedia, 4488–4490, 2020.
Yang, F.; Chang, X.; Dang, C. Y.; Zheng, Z. Q.; Sakti, S.; Nakamura, S.; Wu, Y. ReMOTS: Self-supervised refining multi-object tracking and segmentation. arXiv preprint arXiv:2007.03200, 2020.
Yang, F.; Chang, X.; Sakti, S.; Wu, Y.; Nakamura, S. ReMOT: A model-agnostic refinement for multiple object tracking. Image and Vision Computing Vol. 106, 104091, 2021.
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edn. Cambridge, UK: Cambridge University Press, 2003.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Fan Yang received his B.S. and Ph.D. degrees in information sciences from Nanjing University, China and Nara Institute of Science and Technology, Japan, in 2012 and 2021, respectively. He is currently a researcher at Fujitsu Research. His research focuses on action recognition, pose estimation, and multi-object tracking. He participates in tracking competitions at CVPR, ICCV, and ECCV, and has obtained three 1st places, two 2nd places, and one 4th place.
Shigeyuki Odashima received his B.E., M.E., and Ph.D. degrees from the University of Tokyo in 2008, 2010, and 2013, respectively. He is currently a research scientist at Fujitsu Research. His research interests include robotics, computer vision, ubiquitous computing, and data mining, including human activity recognition, human pose estimation, and human motion assessment.
Sosuke Yamao received his B.E. degree in information engineering and M.S. degree in information sciences from Tohoku University, Sendai, Japan, in 2013 and 2015, respectively. He is currently a researcher at Fujitsu Research. His research interests include image-based 3D scene modeling, human pose estimation, neural rendering, and machine learning for computer vision.
Hiroaki Fujimoto received his B.S. and M.S. degrees in engineering from Tokyo Metropolitan University in 1997 and 1999 respectively. He is currently a principal researcher at Fujitsu Research. His research interests include pose estimation from depth sensors and RGB camera images.
Shoichi Masui received his B.S. and M.S. degrees from Nagoya University, Japan, in 1982 and 1984, respectively. He received his Ph.D. degree from Tokyo Institute of Technology in 2006. From 1990 to 1992, he was a visiting scholar at Stanford University. In 1999, he joined Fujitsu Limited; from 2000 to 2007, he was with Fujitsu Laboratories Ltd., where he was engaged in various IC design projects. In 2001, he was a visiting scholar at the University of Toronto. From 2007 to 2012, he was a professor in the Research Institute of Electrical Communication of Tohoku University. In 2012 he returned to Fujitsu Laboratories. He is currently engaged in pose estimation from depth sensors and RGB cameras for sports applications at Fujitsu Research. He was the recipient of a commendation from the Japanese Minister of Education, Culture, Sports, Science, and Technology in 2004.
Shan Jiang is a project manager at Fujitsu Research. He received his doctoral degree in control engineering from Shanghai Jiao Tong University. His research interests include robotics, human interaction, 3D reconstruction, and image analysis and synthesis. He has served as Director of the Robotics Society of Japan since 2021.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Yang, F., Odashima, S., Yamao, S. et al. A unified multi-view multi-person tracking framework. Comp. Visual Media 10, 137–160 (2024). https://doi.org/10.1007/s41095-023-0334-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-023-0334-8