Advertisement

URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

Conference paper
  • 489 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)

Abstract

We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the object masks referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at https://github.com/skynbe/Refer-Youtube-VOS.

Keywords

Video object segmentation Referring object segmentation 

Notes

Acknowledgement

This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) [2017-0-01779, 2017-0-01780].

Supplementary material

504470_1_En_13_MOESM1_ESM.zip (37.2 mb)
Supplementary material 1 (zip 38143 KB)

References

  1. 1.
    Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv preprint arXiv:1801.00269 (2017)
  2. 2.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  3. 3.
    Caelles, S., et al.: The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557 (2018)
  4. 4.
    Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)Google Scholar
  5. 5.
    Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR (2019)Google Scholar
  6. 6.
    Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR (2018)Google Scholar
  7. 7.
    Goel, V., Weng, J., Poupart, P.: Unsupervised video object segmentation for deep reinforcement learning. In: NIPS (2018)Google Scholar
  8. 8.
    Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: ECCV (2016)Google Scholar
  9. 9.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: CVPR (2013)Google Scholar
  10. 10.
    Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: ACCV (2018)Google Scholar
  11. 11.
    Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)Google Scholar
  12. 12.
    Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR (2018)Google Scholar
  13. 13.
    Li, S., Seybold, B., Vorobyov, A., Lei, X., Jay Kuo, C.C.: Unsupervised video object segmentation with motion-based bilateral networks. In: ECCV (2018)Google Scholar
  14. 14.
    Li, X., Change Loy, C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: ECCV (2018)Google Scholar
  15. 15.
    Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: CVPR (2017)Google Scholar
  16. 16.
    Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)Google Scholar
  17. 17.
    Maninis, K.K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: From extreme points to object segmentation. In: CVPR (2018)Google Scholar
  18. 18.
    Margffoy-Tuay, E., Pérez, J.C., Botero, E., Arbeláez, P.: Dynamic multimodal instance segmentation guided by natural language queries. In: ECCV (2018)Google Scholar
  19. 19.
    Mun, J., Yang, L., Ren, Z., Xu, N., Han, B.: Streamlined dense video captioning. In: CVPR (2019)Google Scholar
  20. 20.
    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_48CrossRefGoogle Scholar
  21. 21.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: CVPR (2019)Google Scholar
  22. 22.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)Google Scholar
  23. 23.
    Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra r-cnn: towards balanced learning for object detection. In: CVPR (2019)Google Scholar
  24. 24.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)Google Scholar
  25. 25.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016) Google Scholar
  26. 26.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)Google Scholar
  27. 27.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  28. 28.
    Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)Google Scholar
  29. 29.
    Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: ICCV (2019)Google Scholar
  30. 30.
    Wang, W., et al.: Learning unsupervised video object segmentation through visual attention. In: CVPR (2019)Google Scholar
  31. 31.
    Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)Google Scholar
  32. 32.
    Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: CVPR (2015)Google Scholar
  33. 33.
    Xu, N., et al.: Youtube-vos: sequence-to-sequence video object segmentation. In: ECCV (2018)Google Scholar
  34. 34.
    Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)Google Scholar
  35. 35.
    Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)Google Scholar
  36. 36.
    Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR (2018)Google Scholar
  37. 37.
    Zhou, T., Wang, S., Zhou, Y., Yao, Y., Li, J., Shao, L.: Motion-attentive transition for zero-shot video object segmentation. In: AAAI (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Seoul National UniversitySeoulSouth Korea
  2. 2.Adobe ResearchSan JoseUSA

Personalised recommendations