Towards Streaming Perception

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12347)


Embodied perception refers to the ability of an autonomous agent to perceive its environment so that it can (re)act. The responsiveness of the agent is largely governed by latency of its processing pipeline. While past work has studied the algorithmic trade-off between latency and accuracy, there has not been a clear metric to compare different methods along the Pareto optimal latency-accuracy curve. We point out a discrepancy between standard offline evaluation and real-time applications: by the time an algorithm finishes processing a particular image frame, the surrounding world has changed. To these ends, we present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception, which we refer to as “streaming accuracy”. The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant, forcing the stack to consider the amount of streaming data that should be ignored while computation is occurring. More broadly, building upon this metric, we introduce a meta-benchmark that systematically converts any image understanding task into a streaming perception task. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations. Our proposed solutions and their empirical analysis demonstrate a number of surprising conclusions: (1) there exists an optimal “sweet spot” that maximizes streaming accuracy along the Pareto optimal latency-accuracy curve, (2) asynchronous tracking and future forecasting naturally emerge as internal representations that enable streaming image understanding, and (3) dynamic scheduling can be used to overcome temporal aliasing, yielding the paradoxical result that latency is sometimes minimized by sitting idle and “doing nothing”.



This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research and was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0051. Annotations for ArgoVerse-HD were provided by Scale AI.

Supplementary material

504434_1_En_28_MOESM1_ESM.pdf (987 kb)
Supplementary material 1 (pdf 987 KB)


  1. 1.
    Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. In: ICCV (2019)Google Scholar
  2. 2.
    Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)Google Scholar
  3. 3.
    Boddy, M., Dean, T.L.: Deliberation scheduling for problem solving in time-constrained environments. Artif. Intell. 67(2), 245–285 (1994)CrossRefGoogle Scholar
  4. 4.
    Cadena, C., et al.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 67(2), 245–286 (2016)Google Scholar
  5. 5.
    Chang, M.F., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: CVPR (2019)Google Scholar
  6. 6.
    Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR (2019)Google Scholar
  7. 7.
    Dendorfer, P., et al.: CVPR19 tracking and detection challenge: how crowded can it get? arXiv:1906.04567 (2019)
  8. 8.
    Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. TPAMI 1(2), 4 (2017)Google Scholar
  9. 9.
    Gao, M., Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Dynamic zoom-in network for fast object detection in large images. In: CVPR (2018)Google Scholar
  10. 10.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV (2017)Google Scholar
  11. 11.
    Horvitz, E.J.: Computation and action under bounded resources. Ph.D. thesis, Stanford University (1990)Google Scholar
  12. 12.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
  13. 13.
    Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME-J. Basic Eng. 82(Series D), 35–45 (1960)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Kosinski, R.J.: A literature review on reaction time. Clemson Univ. 10 (2008)Google Scholar
  15. 15.
    Kristan, M., et al.: The visual object tracking VOT2017 challenge results (2017)Google Scholar
  16. 16.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  17. 17.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)Google Scholar
  18. 18.
    Liu, W., et al.: SSD: Single shot multibox detector. In: ECCV (2016)Google Scholar
  19. 19.
    Luc, P., Couprie, C., LeCun, Y., Verbeek, J.: Predicting future instance segmentations by forecasting convolutional features. In: ECCV (2018)Google Scholar
  20. 20.
    Lukezic, A., Vojír, T., Zajc, L.C., Matas, J., Kristan, M.: Discriminative correlation filter with channel and spatial reliability. In: CVPR (2017)Google Scholar
  21. 21.
    Mao, H., Yang, X., Dally, W.J.: A delay metric for video object detection: what average precision fails to tell. In: ICCV (2019)Google Scholar
  22. 22.
    McLeod, P.: Visual reaction time and high-speed ball games. Perception 16(1), 49–59 (1987)CrossRefGoogle Scholar
  23. 23.
    Mullapudi, R.T., Chen, S., Zhang, K., Ramanan, D., Fatahalian, K.: Online model distillation for efficient video inference. In: ICCV (2019)Google Scholar
  24. 24.
    Quigley, M., et al.: ROS: an open-source robot operating system. In: ICRA Workshop on Open Source Software, Kobe, Japan (2009)Google Scholar
  25. 25.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  26. 26.
    Russell, S.J., Wefald, E.: Do the Right Thing: Studies in Limited Rationality. MIT Press, Cambridge (1991)Google Scholar
  27. 27.
    Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)Google Scholar
  28. 28.
    Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: CVPR (2018)Google Scholar
  29. 29.
    Zilberstein, S.: Using anytime algorithms in intelligent systems. AI Mag. 17(3), 73 (1996)Google Scholar
  30. 30.
    Zilberstein, S., Mouaddib, A.I.: Optimal scheduling of progressive processing tasks. Int. J. Approx. Reason. 25(3), 169–186 (2000)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.CMUPittsburghUSA
  2. 2.UIUCUrbanaUSA
  3. 3.Argo AIPittsburghUSA

Personalised recommendations