Skip to main content

Waymo Open Dataset: Panoramic Video Panoptic Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13689))

Included in the following conference series:

Abstract

Panoptic image segmentation is the computer vision task of finding groups of pixels in an image and assigning semantic classes and object instance identifiers to them. Research in image segmentation has become increasingly popular due to its critical applications in robotics and autonomous driving. The research community thereby relies on publicly available benchmark dataset to advance the state-of-the-art in computer vision. Due to the high costs of densely labeling the images, however, there is a shortage of publicly available ground truth labels that are suitable for panoptic segmentation. The high labeling costs also make it challenging to extend existing datasets to the video domain and to multi-camera setups. We therefore present the Waymo Open Dataset: Panoramic Video Panoptic Segmentation, a large-scale dataset that offers high-quality panoptic segmentation labels for autonomous driving. We generate our dataset using the publicly available Waymo Open Dataset (WOD), leveraging the diverse set of camera images. Our labels are consistent over time for video processing and consistent across multiple cameras mounted on the vehicles for full panoramic scene understanding. Specifically, we offer labels for 28 semantic categories and 2,860 temporal sequences that were captured by five cameras mounted on autonomous vehicles driving in three different geographical locations, leading to a total of 100k labeled camera images. To the best of our knowledge, this makes our dataset an order of magnitude larger than existing datasets that offer video panoptic segmentation labels. We further propose a new benchmark for Panoramic Video Panoptic Segmentation and establish a number of strong baselines based on the DeepLab family of models. We have made the benchmark and the code publicly available, which we hope will facilitate future research on holistic scene understanding. Our dataset can be found at: waymo.com/open.

J. Mei—Work done as an intern at Waymo.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baqué, P., Fleuret, F., Fua, P.: Deep occlusion reasoning for multi-camera multi-target detection. In: ICCV (2017)

    Google Scholar 

  2. Behley, J., et al.: Semantickitti: a dataset for semantic scene understanding of lidar sequences. In: ICCV (2019)

    Google Scholar 

  3. Berclaz, J., Fleuret, F., Turetken, E., Fua, P.: Multiple object tracking using k-shortest paths optimization. PAMI 33(9), 1806–1819 (2011)

    Article  Google Scholar 

  4. Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recogn. Lett. 30(2), 88–97 (2009)

    Article  Google Scholar 

  5. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. In: CVPR (2020)

    Google Scholar 

  6. Chang, M.F., et al.: Argoverse: 3D tracking and forecasting with rich maps. In: CVPR (2019)

    Google Scholar 

  7. Chavdarova, T., et al.: Wildtrack: a multi-camera HD dataset for dense unscripted pedestrian detection. In: CVPR (2018)

    Google Scholar 

  8. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)

    Google Scholar 

  9. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI 40(4), 834–848 (2017)

    Article  Google Scholar 

  10. Chen, Y., et al.: Geosim: realistic video simulation via geometry-aware composition for self-driving. In: CVPR (2021)

    Google Scholar 

  11. Cheng, B., et al.: Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR (2020)

    Google Scholar 

  12. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)

    Google Scholar 

  13. Dehghan, A., Modiri Assari, S., Shah, M.: GMMCP tracker: globally optimal generalized maximum multi clique problem for multiple object tracking. In: CVPR (2015)

    Google Scholar 

  14. Dendorfer, P., et al.: MOTChallenge: a benchmark for single-camera multiple target tracking. IJCV 129(4), 845–888 (2020)

    Article  Google Scholar 

  15. Eshel, R., Moses, Y.: Homography based multiple camera detection and tracking of people in a dense crowd. In: CVPR (2008)

    Google Scholar 

  16. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)

    Article  Google Scholar 

  17. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59(2), 167–181 (2004)

    Article  Google Scholar 

  18. Ferryman, J., Shahrokni, A.: Pets 2009: dataset and challenge. In: 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. IEEE (2009)

    Google Scholar 

  19. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multicamera people tracking with a probabilistic occupancy map. PAMI 30(2), 267–282 (2007)

    Article  Google Scholar 

  20. Gao, N., et al.: SSAP: single-shot instance segmentation with affinity pyramid. In: ICCV (2019)

    Google Scholar 

  21. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  22. Geyer, J., et al.: A2d2: Audi autonomous driving dataset. arXiv preprint arXiv:2004.06320 (2020)

  23. Han, X., et al.: MMPTRACK: large-scale densely annotated multi-camera multiple people tracking benchmark (2021)

    Google Scholar 

  24. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 297–312. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_20

    Chapter  Google Scholar 

  25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  26. He, X., Zemel, R.S., Carreira-Perpiñán, M.Á.: Multiscale conditional random fields for image labeling. In: CVPR (2004)

    Google Scholar 

  27. Hofmann, M., Wolf, D., Rigoll, G.: Hypergraphs for joint multi-view reconstruction and multi-object tracking. In: CVPR (2013)

    Google Scholar 

  28. Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape open dataset for autonomous driving and its application. TPAMI 42(10), 2702–2719 (2019)

    Article  Google Scholar 

  29. Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., Yang, R.: The apolloscape open dataset for autonomous driving and its application. PAMI 42(10), 2702–2719 (2020)

    Article  Google Scholar 

  30. Jaus, A., Yang, K., Stiefelhagen, R.: Panoramic panoptic segmentation: towards complete surrounding understanding via unsupervised contrastive learning. In: 2021 IEEE Intelligent Vehicles Symposium (IV), pp. 1421–1427. IEEE (2021)

    Google Scholar 

  31. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)

    Google Scholar 

  32. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Video Panoptic Segmentation. In: CVPR (2020)

    Google Scholar 

  33. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  34. Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)

    Google Scholar 

  35. Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR (2019)

    Google Scholar 

  36. Kuo, C.-H., Huang, C., Nevatia, R.: Inter-camera association of multi-target tracks by on-line learned appearance affinity models. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 383–396. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_28

    Chapter  Google Scholar 

  37. Ladický, Ľ, Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? Combining object detectors and CRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_31

    Chapter  Google Scholar 

  38. Li, Y., et al.: Attention-guided unified network for panoptic segmentation. In: CVPR (2019)

    Google Scholar 

  39. Liang, J., Homayounfar, N., Ma, W.C., Xiong, Y., Hu, R., Urtasun, R.: Polytransform: deep polygon transformer for instance segmentation. In: CVPR (2020)

    Google Scholar 

  40. Liao, Y., Xie, J., Geiger, A.: Kitti-360: a novel dataset and benchmarks for urban scene understanding in 2D and 3D. arXiv:2109.13410 (2021)

  41. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  42. Ling, H., Acuna, D., Kreis, K., Kim, S.W., Fidler, S.: Variational amodal object completion. In: NeurIPS (2020)

    Google Scholar 

  43. Liu1, H., et al.: An end-to-end network for panoptic segmentation. In: CVPR (2019)

    Google Scholar 

  44. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

    Google Scholar 

  45. Luiten, J., et al.: HOTA: a higher order metric for evaluating multi-object tracking. In: IJCV (2020)

    Google Scholar 

  46. Mallya, A., Wang, T.-C., Sapra, K., Liu, M.-Y.: World-consistent video-to-video synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 359–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_22

    Chapter  Google Scholar 

  47. Miao, J., Wei, Y., Wu, Y., Liang, C., Li, G., Yang, Y.: VSPW: a large-scale dataset for video scene parsing in the wild. In: CVPR (2021)

    Google Scholar 

  48. Narioka, K., Nishimura, H., Itamochi, T., Inomata, T.: Understanding 3D semantic structure around the vehicle with monocular cameras. In: IEEE Intelligent Vehicles Symposium (IV), pp. 132–137. IEEE (2018)

    Google Scholar 

  49. Neuhold, G., Ollmann, T., Bulò, S.R., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)

    Google Scholar 

  50. Petrovai, A., Nedevschi, S.: Semantic cameras for 360-degree environment perception in automated urban driving. IEEE Trans. Intell. Transp. Syst. (2022)

    Google Scholar 

  51. Philion, J., Fidler, S.: Lift, splat, shoot: encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 194–210. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_12

    Chapter  Google Scholar 

  52. Porzi, L., Bulò, S.R., Colovic, A., Kontschieder, P.: Seamless scene segmentation. In: CVPR (2019)

    Google Scholar 

  53. Qi, C.R., et al.: Offboard 3D object detection from point cloud sequences. In: CVPR (2021)

    Google Scholar 

  54. Qiao, S., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: ViP-DeepLab: learning visual perception with depth-aware video panoptic segmentation. In: CVPR (2021)

    Google Scholar 

  55. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2

    Chapter  Google Scholar 

  56. Ristani, E., Tomasi, C.: Features for multi-target multi-camera tracking and re-identification. In: CVPR (2018)

    Google Scholar 

  57. Roddick, T., Cipolla, R.: Predicting semantic map representations from images using pyramid occupancy networks. In: CVPR (2020)

    Google Scholar 

  58. Roshan Zamir, A., Dehghan, A., Shah, M.: GMCP-tracker: global multi-object tracking using generalized minimum clique graphs. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 343–356. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_25

    Chapter  Google Scholar 

  59. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  60. Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31

    Chapter  Google Scholar 

  61. Shi, J., Malik, J.: Normalized cuts and image segmentation. PAMI 22(8), 888–905 (2000)

    Article  Google Scholar 

  62. Song, S., Zeng, A., Chang, A.X., Savva, M., Savarese, S., Funkhouser, T.: Im2Pano3D: extrapolating 360 structure and semantics beyond the field of view. In: CVPR (2018)

    Google Scholar 

  63. Su, Y.C., Grauman, K.: Making 360 video watchable in 2D: learning videography for click free viewing. In: CVPR (2017)

    Google Scholar 

  64. Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

    Google Scholar 

  65. Tang, Z., et al.: Cityflow: a city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification. In: CVPR (2019)

    Google Scholar 

  66. Tateno, K., Navab, N., Tombari, F.: Distortion-aware convolutional filters for dense prediction in panoramic images. In: ECCV (2018)

    Google Scholar 

  67. Thrun, S., Montemerlo, M.: The graph slam algorithm with applications to large-scale mapping of urban structures. Int. J. Robot. Res. 25(5–6), 403–429 (2006)

    Article  Google Scholar 

  68. Tu, Z., Chen, X., Yuille, A.L., Zhu, S.C.: Image parsing: unifying segmentation, detection, and recognition. IJCV 63(2), 113–140 (2005)

    Article  Google Scholar 

  69. Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: CVPR (2019)

    Google Scholar 

  70. Wang, H., Luo, R., Maire, M., Shakhnarovich, G.: Pixel consensus voting for panoptic segmentation. In: CVPR (2020)

    Google Scholar 

  71. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-DeepLab: stand-alone axial-attention for panoptic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 108–126. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_7

    Chapter  Google Scholar 

  72. Weber, M., Luiten, J., Leibe, B.: Single-shot panoptic segmentation. In: IROS (2020)

    Google Scholar 

  73. Weber, M., et al.: DeepLab2: A TensorFlow Library for Deep Labeling. arXiv: 2106.09748 (2021)

  74. Weber, M., et al.: Step: segmenting and tracking every pixel. In: NeurIPS Track on Datasets and Benchmarks (2021)

    Google Scholar 

  75. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR (2013)

    Google Scholar 

  76. Xiong, Y., et al.: UPSNet: a unified panoptic segmentation network. In: CVPR (2019)

    Google Scholar 

  77. Xu, C., Xiong, C., Corso, J.J.: Streaming hierarchical video segmentation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 626–639. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_45

    Chapter  Google Scholar 

  78. Xu, Y., Liu, X., Liu, Y., Zhu, S.C.: Multi-view people tracking via hierarchical trajectory composition. In: CVPR (2016)

    Google Scholar 

  79. Yang, B., Bai, M., Liang, M., Zeng, W., Urtasun, R.: Auto4d: learning to label 4D objects from sequential point clouds. arXiv preprint arXiv:2101.06586 (2021)

  80. Yang, K., Hu, X., Bergasa, L.M., Romera, E., Wang, K.: Pass: Panoramic annular semantic segmentation. IEEE Trans. Intell. Transp. Syst. 21(10), 4171–4185 (2019)

    Article  Google Scholar 

  81. Yang, K., Zhang, J., Reiß, S., Hu, X., Stiefelhagen, R.: Capturing omni-range context for omnidirectional segmentation. In: CVPR (2021)

    Google Scholar 

  82. Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)

    Google Scholar 

  83. Yang, T.J., et al.: DeeperLab: Single-Shot Image Parser. arXiv:1902.05093 (2019)

  84. Yao, J., Fidler, S., Urtasun, R.: Describing the scene as a whole: joint object detection, scene classification and semantic segmentation. In: CVPR (2012)

    Google Scholar 

  85. Yogamani, S., et al.: Woodscape: a multi-task, multi-camera fisheye dataset for autonomous driving. In: ICCV (2019)

    Google Scholar 

  86. Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)

    Google Scholar 

  87. Zakharov, S., Kehl, W., Bhargava, A., Gaidon, A.: Autolabeling 3D objects with differentiable rendering of SDF shape priors. In: CVPR (2020)

    Google Scholar 

  88. Zendel, O., Schörghuber, M., Rainer, B., Murschitz, M., Beleznai, C.: Unifying panoptic segmentation for autonomous driving. In: CVPR (2022)

    Google Scholar 

  89. Zhang, C., Liwicki, S., Smith, W., Cipolla, R.: Orientation-aware semantic segmentation on icosahedron spheres. In: ICCV (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alex Zihao Zhu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 19218 KB)

Supplementary material 2 (pdf 601 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mei, J. et al. (2022). Waymo Open Dataset: Panoramic Video Panoptic Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13689. Springer, Cham. https://doi.org/10.1007/978-3-031-19818-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19818-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19817-5

  • Online ISBN: 978-3-031-19818-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics