Video Object Segmentation with Referring Expressions

  • Anna KhorevaEmail author
  • Anna Rohrbach
  • Bernt Schiele
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11132)


Most semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first video frame. However, obtaining a detailed mask is expensive and time-consuming. In this work we explore a more practical and natural way of identifying a target object by employing language referring expressions. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, \(\text {DAVIS}_{\text {16}}\) and \(\text {DAVIS}_{\text {17}}\), with language descriptions of target objects. We show that our approach performs on par with the methods which have access to the object mask on \(\text {DAVIS}_{\text {16}}\) and is competitive to methods using scribbles on challenging \(\text {DAVIS}_{\text {17}}\).


  1. 1.
    Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In: CVPR (2017)Google Scholar
  2. 2.
    Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation (2017). arXiv:1706.05587
  3. 3.
    Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. PAMI 37, 569–582 (2015)CrossRefGoogle Scholar
  4. 4.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)Google Scholar
  5. 5.
    Khoreva, A., Rohrbach, A., Schiele, B.: Video object segmentation with language referring expressions (2018). arXiv:1803.08006
  6. 6.
    Maninis, K., Caelles, S., Pont-Tuset, J., Gool, L.V.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)Google Scholar
  7. 7.
    Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)Google Scholar
  8. 8.
    Pont-Tuset, J., et al.: The 2018 Davis challenge on video object segmentation (2018). arXiv:1803.00557
  9. 9.
    Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)Google Scholar
  10. 10.
    Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)Google Scholar
  11. 11.
    Yu, L., et al.: Mattnet: modular attention network for referring expression comprehension. In: CVPR (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.University of CaliforniaBerkeleyUSA

Personalised recommendations