Advertisement

Segment as Points for Efficient Online Multi-Object Tracking and Segmentation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)

Abstract

Current multi-object tracking and segmentation (MOTS) methods follow the tracking-by-detection paradigm and adopt convolutions for feature extraction. However, as affected by the inherent receptive field, convolution based feature extraction inevitably mixes up the foreground features and the background features, resulting in ambiguities in the subsequent instance association. In this paper, we propose a highly effective method for learning instance embeddings based on segments by converting the compact image representation to un-ordered 2D point cloud representation. Our method generates a new tracking-by-points paradigm where discriminative instance embeddings are learned from randomly selected points rather than images. Furthermore, multiple informative data modalities are converted into point-wise representations to enrich point-wise features. The resulting online MOTS framework, named PointTrack, surpasses all the state-of-the-art methods including 3D tracking methods by large margins (5.4% higher MOTSA and 18 times faster over MOTSFusion) with the near real-time speed (22 FPS). Evaluations across three datasets demonstrate both the effectiveness and efficiency of our method. Moreover, based on the observation that current MOTS datasets lack crowded scenes, we build a more challenging MOTS dataset named APOLLO MOTS with higher instance density. Both APOLLO MOTS and our codes are publicly available at https://github.com/detectRecog/PointTrack.

Keywords

Motion and tracking Tracking Vision for robotics 

Notes

Acknowledgement

This work was supported by the Anhui Initiative in Quantum Information Technologies (No. AHY150300).

Supplementary material

500725_1_En_16_MOESM1_ESM.zip (70.7 mb)
Supplementary material 1 (zip 72409 KB)

References

  1. 1.
    Baser, E., Balasubramanian, V., Bhattacharyya, P., Czarnecki, K.: Fantrack: 3d multi-object tracking with feature association network. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 1426–1433. IEEE (2019)Google Scholar
  2. 2.
    Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 941–951 (2019)Google Scholar
  3. 3.
    Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468. IEEE (2016)Google Scholar
  4. 4.
    Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6182–6191 (2019)Google Scholar
  5. 5.
    Chen, L., Ai, H., Zhuang, Z., Shang, C.: Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2018)Google Scholar
  6. 6.
    Chu, P., Ling, H.: Famnet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6172–6181 (2019)Google Scholar
  7. 7.
    Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)Google Scholar
  8. 8.
    Geiger, A., Lauer, M., Wojek, C., Stiller, C., Urtasun, R.: 3d traffic scene understanding from movable platforms. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 1012–1025 (2013)CrossRefGoogle Scholar
  9. 9.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  10. 10.
    Held, D., Levinson, J., Thrun, S.: Precision tracking with sparse 3D and dense color 2D data. In: 2013 IEEE International Conference on Robotics and Automation, pp. 1138–1145. IEEE (2013)Google Scholar
  11. 11.
    Henschel, R., Leal-Taixé, L., Cremers, D., Rosenhahn, B.: Fusion of head and full-body detectors for multi-object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1428–1437 (2018)Google Scholar
  12. 12.
    Hu, A., Kendall, A., Cipolla, R.: Learning a spatio-temporal embedding for video instance segmentation (2019). arXiv preprint arXiv:1912.08969
  13. 13.
    Huang, X., et al.: The apolloscape dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 954–960 (2018)Google Scholar
  14. 14.
    Karunasekera, H., Wang, H., Zhang, H.: Multiple object tracking with attention to appearance, structure, motion and size. IEEE Access 7, 104423–104434 (2019)CrossRefGoogle Scholar
  15. 15.
    Keuper, M., Tang, S., Andres, B., Brox, T., Schiele, B.: Motion segmentation & multiple object tracking by correlation co-clustering. IEEE Trans. Pattern Anal. Mach. Intell. 42(1), 140–153 (2018)CrossRefGoogle Scholar
  16. 16.
    Kim, C., Li, F., Ciptadi, A., Rehg, J.M.: Multiple hypothesis tracking revisited. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4696–4704 (2015)Google Scholar
  17. 17.
    Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Luiten, J., Fischer, T., Leibe, B.: Track to reconstruct and reconstruct to track. IEEE Rob. Autom. Lett. 5, 1803–10810 (2020)CrossRefGoogle Scholar
  19. 19.
    Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20870-7_35CrossRefGoogle Scholar
  20. 20.
    Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577 (2018)Google Scholar
  21. 21.
    Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv:1603.00831 [cs] (Mar 2016). http://arxiv.org/abs/1603.00831, arXiv: 1603.00831
  22. 22.
    Mitzel, D., Leibe, B.: Taking mobile multi-object tracking to the next level: people, unknown objects, and carried items. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 566–579. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_41CrossRefGoogle Scholar
  23. 23.
    Neven, D., Brabandere, B.D., Proesmans, M., Gool, L.V.: Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  24. 24.
    Osep, A., Mehner, W., Mathias, M., Leibe, B.: Combined image-and world-space tracking in traffic scenes. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1988–1995. IEEE (2017)Google Scholar
  25. 25.
    Ošep, A., Mehner, W., Voigtlaender, P., Leibe, B.: Track, then decide: category-agnostic vision-based multi-object tracking. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)Google Scholar
  26. 26.
    Payer, C., Štern, D., Neff, T., Bischof, H., Urschler, M.: Instance segmentation and tracking with cosine embeddings and recurrent hourglass networks. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 3–11. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-00934-2_1CrossRefGoogle Scholar
  27. 27.
    Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulò, S.R., Kontschieder, P.: Learning multi-object tracking and segmentation from automatic annotations (2019). arXiv preprint arXiv:1912.02096
  28. 28.
    Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)Google Scholar
  29. 29.
    Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J.: Amodal instance segmentation with kins dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3014–3023 (2019)Google Scholar
  30. 30.
    Ren, J., et al.: Accurate single stage detector using recurrent rolling convolution. In: CVPR (2017)Google Scholar
  31. 31.
    Sharma, S., Ansari, J.A., Murthy, J.K., Krishna, K.M.: Beyond pixels: leveraging geometry and shape cues for online multi-object tracking. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3508–3515. IEEE (2018)Google Scholar
  32. 32.
    Tian, W., Lauer, M., Chen, L.: Online multi-object tracking using joint domain information in traffic scenarios. IEEE Trans. Intell. Transp. Syst. 21, 374–384 (2019)CrossRefGoogle Scholar
  33. 33.
    Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008 (2017)Google Scholar
  34. 34.
    Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7942–7951 (2019)Google Scholar
  35. 35.
    Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. IEEE (2017)Google Scholar
  36. 36.
    Xu, J., Cao, Y., Zhang, Z., Hu, H.: Spatial-temporal relation networks for multi-object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3988–3998 (2019)Google Scholar
  37. 37.
    Xu, Z., et al.: Towards end-to-end license plate detection and recognition: a large dataset and baseline. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 255–271 (2018)Google Scholar
  38. 38.
    Xu, Z., et al.: ZoomNet: part-aware adaptive zooming neural network for 3D object detection. In: AAAI, pp. 12557–12564 (2020)Google Scholar
  39. 39.
    Yang, G., Ramanan, D.: Volumetric correspondence networks for optical flow. In: Advances in Neural Information Processing Systems, pp. 793–803 (2019)Google Scholar
  40. 40.
    Yuan, Y., Chen, W., Yang, Y., Wang, Z.: In defense of the triplet loss again: learning robust person re-identification with fast approximated triplet loss and label distillation (2019). arXiv preprint arXiv:1912.07863
  41. 41.
    Zhang, W., Zhou, H., Sun, S., Wang, Z., Shi, J., Loy, C.C.: Robust multi-modality multi-object tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2365–2374 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina
  2. 2.Department of Computer Vision Technology (VIS)Baidu Inc.BeijingChina

Personalised recommendations