Abstract
Since decades, the problem of multiple people tracking has been tackled leveraging 2D data only. However, people moves and interact in a three-dimensional space. For this reason, using only 2D data might be limiting and overly challenging, especially due to occlusions and multiple overlapping people. In this paper, we take advantage of 3D synthetic data from the novel MOTSynth dataset, to train our proposed 3D people detector, whose observations are fed to a tracker that works in the corresponding 3D space. Compared to conventional 2D trackers, we show an overall improvement in performance with a reduction of identity switches on both real and synthetic data. Additionally, we propose a tracker that jointly exploits 3D and 2D data, showing an improvement over the proposed baselines. Our experiments demonstrate that 3D data can be beneficial, and we believe this paper will pave the road for future efforts in leveraging 3D data for tackling multiple people tracking. The code is available at (https://github.com/GianlucaMancusi/LoCO-Det).
Keywords
- Multiple-object tracking
- Synthetic data
- 3D people detection
This is a preview of subscription content, access via your institution.
Buying options


References
Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. In: ICCV (2019)
Bewley, A., et al.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), September 2016
Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: CVPR (2020)
Dendorfer, P., et al.: Mot20: a benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020)
Dendorfer, P., et al.: Motchallenge: a benchmark for single-camera multiple target tracking. Int. J. Comput. Vis. 129(4), 845–881 (2021)
Fabbri, M., et al.: Compressed volumetric heatmaps for multi-person 3D pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7204–7213 (2020)
Fabbri, M., et al.: Learning to detect and track visible and occluded body joints in a virtual world. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 450–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_27
Fabbri, M., et al.: MOTSynth: how can synthetic data help pedestrian detection and tracking? In: International Conference on Computer Vision (ICCV) (2021)
Fan, T., et al.: Revitalizing optimization for 3D human pose and shape estimation: a sparse constrained formulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Gordon, D.M., Paul, R.E., Thorpe, K.: What is the function of encounter patterns in ant colonies? Anim. Behav. 45(6), 1083–1100 (1993). ISSN: 0003-3472
Huang, Y., et al.: SQE: a self quality evaluation metric for parameters optimization in multi-object tracking. In: CVPR (2020)
Kim, C., Li, F., Rehg, J.M.: Multi-object tracking with neural gating using bilinear LSTM. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 208–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_13
Kwon, O.-H., Tanke, J., Gall, J.: Recursive Bayesian filtering for multiple human pose tracking from multiple cameras. In: Ishikawa, H., Liu, C.-L., Pajdla, T., Shi, J. (eds.) ACCV 2020. LNCS, vol. 12623, pp. 438–453. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69532-3_27
Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: Siamese CNN for robust target association. In: CVPR Workshops (2016)
Luo, W., et al.: Multiple object tracking: a literature review. Artif. Intell. 293, 103448 (2021)
Milan, A., et al.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV, pp. 10133–10142 (2019)
Pang, J., et al.: Quasi-dense similarity learning for multiple object tracking, June 2021
Pham, N.T., Huang, W., Ong, S.H.: Probability hypothesis density approach for multi-camera multi-object tracking. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007. LNCS, vol. 4843, pp. 875–884. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76386-4_83
Quach, K.G., et al.: DyGLIP: a dynamic graph model with link prediction for accurate multi-camera multiple object tracking. In: CVPR, pp. 13784–13793, June 2021
Rajasegaran, J., et al.: Tracking people by predicting 3D appearance, location & pose. ArXiv abs/2112.04477 (2021)
Rajasegaran, J., et al.: Tracking people with 3D representations. In: NeurIPS (2021)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Reid, D.B.: An algorithm for tracking multiple targets. IEEE Trans. Autom. Control 24, 843–854 (1979)
Sato, S.: Multilayer lidar-based pedestrian tracking in urban environments. In: IEEE Intelligent Vehicles Symposium, pp. 849–854. IEEE (2010)
Son, J., et al.: Multi-object tracking with quadruplet convolutional neural networks. In: CVPR (2017)
Tokmakov, P., et al.: Learning to track with object permanence (2021)
Weng, X., et al.: GNN3DMOT: graph neural network for 3D multi-object tracking with 2D-3D multi-feature learning. In: CVPR, pp. 6499–6508 (2020)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP, pp. 3645–3649. IEEE (2017)
Xu, Y., et al.: How to train your deep multi-object tracker. In: CVPR (2020)
Yin, J., et al.: A unified object motion and affinity model for online multi-object tracking. In: CVPR (2020)
Zeng, F., et al.: MOTR: end-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247 (2021)
Zhang, Y., et al.: ByteTrack: multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864 (2021)
Zhang, Y., et al.: 4D association graph for realtime multi-person motion capture using multiple video cameras. In: CVPR, pp. 1324–1333 (2020)
Zheng, C., et al.: 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11656–11665, October 2021
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points arXiv preprint arXiv:1904.07850 (2019)
Acknowledgements
Partially supported by the PREVUE “PRediction of activities and Events by Vision in an Urban Environment” project (CUP E94I19000650001), PRIN National Research Program, Italian Ministry for Education, University and Research (MIUR), by ROADSTER “Road Sustainable Twins in Emilia Romagna” project, International Foundation Big Data and Artificial Intelligence for Human Development, and by Tetra Pak Packaging Solutions S.P.A.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mancusi, G. et al. (2022). First Steps Towards 3D Pedestrian Detection and Tracking from Single Image. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-06430-2_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06429-6
Online ISBN: 978-3-031-06430-2
eBook Packages: Computer ScienceComputer Science (R0)