URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)


We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the object masks referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at


Video object segmentation Referring object segmentation 



This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) [2017-0-01779, 2017-0-01780].

Supplementary material (37.2 mb)
Supplementary material 1 (zip 38143 KB)


  1. 1.
    Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv preprint arXiv:1801.00269 (2017)
  2. 2.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  3. 3.
    Caelles, S., et al.: The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557 (2018)
  4. 4.
    Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)Google Scholar
  5. 5.
    Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR (2019)Google Scholar
  6. 6.
    Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR (2018)Google Scholar
  7. 7.
    Goel, V., Weng, J., Poupart, P.: Unsupervised video object segmentation for deep reinforcement learning. In: NIPS (2018)Google Scholar
  8. 8.
    Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: ECCV (2016)Google Scholar
  9. 9.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: CVPR (2013)Google Scholar
  10. 10.
    Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions. In: ACCV (2018)Google Scholar
  11. 11.
    Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)Google Scholar
  12. 12.
    Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR (2018)Google Scholar
  13. 13.
    Li, S., Seybold, B., Vorobyov, A., Lei, X., Jay Kuo, C.C.: Unsupervised video object segmentation with motion-based bilateral networks. In: ECCV (2018)Google Scholar
  14. 14.
    Li, X., Change Loy, C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: ECCV (2018)Google Scholar
  15. 15.
    Li, Z., Tao, R., Gavves, E., Snoek, C.G., Smeulders, A.W.: Tracking by natural language specification. In: CVPR (2017)Google Scholar
  16. 16.
    Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)Google Scholar
  17. 17.
    Maninis, K.K., Caelles, S., Pont-Tuset, J., Van Gool, L.: Deep extreme cut: From extreme points to object segmentation. In: CVPR (2018)Google Scholar
  18. 18.
    Margffoy-Tuay, E., Pérez, J.C., Botero, E., Arbeláez, P.: Dynamic multimodal instance segmentation guided by natural language queries. In: ECCV (2018)Google Scholar
  19. 19.
    Mun, J., Yang, L., Ren, Z., Xu, N., Han, B.: Streamlined dense video captioning. In: CVPR (2019)Google Scholar
  20. 20.
    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). Scholar
  21. 21.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Fast user-guided video object segmentation by interaction-and-propagation networks. In: CVPR (2019)Google Scholar
  22. 22.
    Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)Google Scholar
  23. 23.
    Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., Lin, D.: Libra r-cnn: towards balanced learning for object detection. In: CVPR (2019)Google Scholar
  24. 24.
    Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)Google Scholar
  25. 25.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016) Google Scholar
  26. 26.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)Google Scholar
  27. 27.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  28. 28.
    Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)Google Scholar
  29. 29.
    Wang, H., Deng, C., Yan, J., Tao, D.: Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In: ICCV (2019)Google Scholar
  30. 30.
    Wang, W., et al.: Learning unsupervised video object segmentation through visual attention. In: CVPR (2019)Google Scholar
  31. 31.
    Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)Google Scholar
  32. 32.
    Xu, C., Hsieh, S.H., Xiong, C., Corso, J.J.: Can humans fly? action understanding with multiple classes of actors. In: CVPR (2015)Google Scholar
  33. 33.
    Xu, N., et al.: Youtube-vos: sequence-to-sequence video object segmentation. In: ECCV (2018)Google Scholar
  34. 34.
    Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)Google Scholar
  35. 35.
    Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR (2019)Google Scholar
  36. 36.
    Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR (2018)Google Scholar
  37. 37.
    Zhou, T., Wang, S., Zhou, Y., Yao, Y., Li, J., Shao, L.: Motion-attentive transition for zero-shot video object segmentation. In: AAAI (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Seoul National UniversitySeoulSouth Korea
  2. 2.Adobe ResearchSan JoseUSA

Personalised recommendations