Tracking Objects as Points

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)


Tracking has traditionally been the art of following interest points through space and time. This changed with the rise of powerful deep networks. Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection. We present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That’s it. CenterTrack is simple, online (no peeking into the future), and real-time. It achieves \(67.8\%\) MOTA on the MOT17 challenge at 22 FPS and \(89.4\%\) MOTA on the KITTI tracking benchmark at 15 FPS, setting a new state of the art on both datasets. CenterTrack is easily extended to monocular 3D tracking by regressing additional 3D attributes. Using monocular video input, it achieves \(28.3\%\) AMOTA@0.2 on the newly released nuScenes 3D tracking benchmark, substantially outperforming the monocular baseline on this benchmark while running at 28 FPS.


Multi-object tracking Conditioned detection 3D object tracking 



This work has been supported in part by the National Science Foundation under grant IIS-1845485.

Supplementary material

504439_1_En_28_MOESM1_ESM.pdf (217 kb)
Supplementary material 1 (pdf 216 KB)


  1. 1.
    Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019)Google Scholar
  2. 2.
    Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)Google Scholar
  3. 3.
    Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)Google Scholar
  4. 4.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)Google Scholar
  5. 5.
    Choi, W., Savarese, S.: Multiple target tracking in world coordinate with single, minimally calibrated camera. In: ECCV (2010)Google Scholar
  6. 6.
    Evangelidis, G.D., Psarakis, E.Z.: Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. Pattern Anal. Mach. Intell. 30(10), 1858–1865 (2008) CrossRefGoogle Scholar
  7. 7.
    Fang, K., Xiang, Y., Li, X., Savarese, S.: Recurrent autoregressive networks for online multi-object tracking. In: WACV (2018)Google Scholar
  8. 8.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: ICCV (2017)Google Scholar
  9. 9.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. In: TPAMI (2009)Google Scholar
  10. 10.
    Feng, W., Hu, Z., Wu, W., Yan, J., Ouyang, W.: Multi-object tracking with multiple cues and switcher-aware classification. arXiv:1901.06129 (2019)
  11. 11.
    Fieraru, M., Khoreva, A., Pishchulin, L., Schiele, B.: Learning to refine human pose estimation. In: CVPR Workshops (2018)Google Scholar
  12. 12.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  13. 13.
    Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. In: ECCV (2016)Google Scholar
  14. 14.
    Hu, H.N., et al.: Joint monocular 3D detection and tracking. In: ICCV (2019)Google Scholar
  15. 15.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)Google Scholar
  16. 16.
    Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR (2017)Google Scholar
  17. 17.
    Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. Circuits Syst. Video Technol. 28(10), 2896–2907 (2017)CrossRefGoogle Scholar
  18. 18.
    Karunasekera, H., Wang, H., Zhang, H.: Multiple object tracking with attention to appearance, structure, motion and size. IEEE Access 7, 104423–104434 (2019)CrossRefGoogle Scholar
  19. 19.
    Keuper, M., Tang, S., Andres, B., Brox, T., Schiele, B.: Motion segmentation and multiple object tracking by correlation co-clustering. IEEE Trans. Pattern Anal. Mach. Intell. 42(1), 140–153 (2018)CrossRefGoogle Scholar
  20. 20.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  21. 21.
    Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: ECCV (2018)Google Scholar
  22. 22.
    Leal-Taixé, L., Canton-Ferrer, C., Schindler, K.: Learning by tracking: Siamese CNN for robust target association. In: CVPR Workshops (2016)Google Scholar
  23. 23.
    Leal-Taixé, L., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S.: Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv:1704.02781 (2017)
  24. 24.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  25. 25.
    Long, C., Haizhou, A., Zijie, Z., Chong, S.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: ICME (2018)Google Scholar
  26. 26.
    Luiten, J., Fischer, T., Leibe, B.: Track to reconstruct and reconstruct to track. arXiv:1910.00130 (2019)
  27. 27.
    Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv:1603.00831 (2016)
  28. 28.
    Moon, G., Chang, J., Lee, K.M.: PoseFix: model-agnostic general human pose refinement network. In: CVPR (2019)Google Scholar
  29. 29.
    Ren, J., et al.: Accurate single stage detector using recurrent rolling convolution. In: CVPR (2017)Google Scholar
  30. 30.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  31. 31.
    Sadeghian, A., Alahi, A., Savarese, S.: Tracking the untrackable: learning to track multiple cues with long-term dependencies. In: ICCV (2017)Google Scholar
  32. 32.
    Schulter, S., Vernaza, P., Choi, W., Chandraker, M.: Deep network flow for multi-object tracking. In: CVPR (2017)Google Scholar
  33. 33.
    Shao, S., et al.: CrowdHuman: a benchmark for detecting human in a crowd. arXiv:1805.00123 (2018)
  34. 34.
    Sharma, S., Ansari, J.A., Murthy, J.K., Krishna, K.M.: Beyond pixels: leveraging geometry and shape cues for online multi-object tracking. In: ICRA (2018)Google Scholar
  35. 35.
    Shi, J., Tomasi, C.: Good features to track. In: CVPR (1994)Google Scholar
  36. 36.
    Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: CVPR (2019)Google Scholar
  37. 37.
    Simonelli, A., Bulò, S.R.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: ICCV (2019)Google Scholar
  38. 38.
    Son, J., Baek, M., Cho, M., Han, B.: Multi-object tracking with quadruplet convolutional neural networks. In: CVPR (2017)Google Scholar
  39. 39.
    Stiefelhagen, R., Bernardin, K., Bowers, R., Garofolo, J., Mostefa, D., Soundararajan, P.: The CLEAR 2006 evaluation. In: Stiefelhagen, R., Garofolo, J. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 1–44. Springer, Heidelberg (2007). Scholar
  40. 40.
    Tang, S., Andriluka, M., Andres, B., Schiele, B.: Multiple people tracking by lifted multicut and person re-identification. In: CVPR (2017)Google Scholar
  41. 41.
    Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)Google Scholar
  42. 42.
    Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical report CMU-CS-91-132, Carnegie Mellon University (1991)Google Scholar
  43. 43.
    Tu, Z.: Auto-context and its application to high-level vision tasks. In: CVPR (2008)Google Scholar
  44. 44.
    Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: CVPR (2019)Google Scholar
  45. 45.
    Weng, X., Kitani, K.: A baseline for 3D multi-object tracking. arXiv:1907.03961 (2019)
  46. 46.
    Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)Google Scholar
  47. 47.
    Xiang, Y., Alahi, A., Savarese, S.: Learning to track: online multi-object tracking by decision making. In: ICCV (2015)Google Scholar
  48. 48.
    Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV (2018)Google Scholar
  49. 49.
    Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi-object tracking. In: ICCV (2019)Google Scholar
  50. 50.
    Yang, F., Choi, W., Lin, Y.: Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers. In: CVPR (2016)Google Scholar
  51. 51.
    Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: POI: multiple object tracking with high performance detection and appearance feature. In: ECCV Workshops (2016)Google Scholar
  52. 52.
    Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR (2018)Google Scholar
  53. 53.
    Zhang, W., Zhou, H., Sun, S., Wang, Z., Shi, J., Loy, C.C.: Robust multi-modality multi-object tracking. In: ICCV (2019)Google Scholar
  54. 54.
    Zhang, Z., Cheng, D., Zhu, X., Lin, S., Dai, J.: Integrated object detection and tracking with tracklet-conditioned detection. arXiv:1811.11167 (2018)
  55. 55.
    Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)
  56. 56.
    Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3D object detection. arXiv:1908.09492 (2019)
  57. 57.
    Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.H.: Online multi-object tracking with dual matching attention networks. In: ECCV (2018)Google Scholar
  58. 58.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.UT AustinAustinUSA
  2. 2.Intel LabsHillsboroUSA

Personalised recommendations