Skip to main content

Video Object Segmentation with Language Referring Expressions

  • Conference paper
  • First Online:
Computer Vision – ACCV 2018 (ACCV 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11364))

Included in the following conference series:

Abstract

Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our approach we augment the popular video object segmentation benchmarks, \({\text {DAVIS}}_{{16}}\) and \({\text {DAVIS}}_{{17}}\) with language descriptions of target objects. We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on \({\text {DAVIS}}_{{16}}\) and is competitive to methods using scribbles on the challenging \({\text {DAVIS}}_{{17}}\) dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Balajee Vasudevan, A., Dai, D., Van Gool, L.: Object referring in videos with language and human gaze. In: CVPR (2018)

    Google Scholar 

  2. Benard, A., Gygli, M.: Interactive video object segmentation in the wild. arXiv:1801.00269 (2017)

  3. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Gool, L.V.: One-shot video object segmentation. In: CVPR (2017)

    Google Scholar 

  4. Chen, K., Kovvuri, R., Nevatia, R.: Query-guided regression network with context policy for phrase grounding. In: ICCV (2017)

    Google Scholar 

  5. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915 (2016)

  6. Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)

    Google Scholar 

  7. Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.M.: Global contrast based salient region detection. PAMI 37, 569–582 (2015)

    Article  Google Scholar 

  8. Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. PAMI. 37, 1558–1570 (2015)

    Article  Google Scholar 

  9. Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC (2014)

    Google Scholar 

  10. Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: CVPR (2018)

    Google Scholar 

  11. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)

    Google Scholar 

  12. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  14. Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)

    Google Scholar 

  15. Hu, Y.T., Huang, J., Schwing, A.G.: MaskRNN: instance level video object segmentation. In: NIPS (2017)

    Google Scholar 

  16. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)

    Google Scholar 

  17. Jain, S.D., Xiong, B., Grauman, K.: FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: CVPR (2017)

    Google Scholar 

  18. Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for multiple object tracking. arXiv:1703.09554 (2017)

  19. Koh, Y., Kim, C.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)

    Google Scholar 

  20. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332 (2016)

  21. Li, R., et al.: Referring image segmentation via recurrent refinement networks. In: CVPR (2018)

    Google Scholar 

  22. Li, Z., Tao, R., Gavves, E., Snoek, C.G.M., Smeulders, A.W.M.: Tracking by natural language specification. In: CVPR (2017)

    Google Scholar 

  23. Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)

    Google Scholar 

  24. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: ICCV (2017)

    Google Scholar 

  26. Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. arxiv:1506.04579 (2015)

  27. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)

    Google Scholar 

  28. Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions. In: CVPR (2017)

    Google Scholar 

  29. Maninis, K., Caelles, S., Pont-Tuset, J., Gool, L.V.: Deep extreme cut: from extreme points to object segmentation. In: CVPR (2018)

    Google Scholar 

  30. Maninis, K., et al.: Video object segmentation without temporal information. arxiv:1709.06031 (2017)

  31. Mao, J., Jonathan, H., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)

    Google Scholar 

  32. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48

    Chapter  Google Scholar 

  33. Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV (2013)

    Google Scholar 

  34. Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)

    Google Scholar 

  35. Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR (2017)

    Google Scholar 

  36. Perazzi, F., Wang, O., Gross, M., Sorkine-Hornung, A.: Fully connected object proposals for video segmentation. In: ICCV (2015)

    Google Scholar 

  37. Pont-Tuset, J., et al.: The 2018 DAVIS challenge on video object segmentation. arXiv:1803.00557 (2018)

  38. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 (2017)

  39. Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM (2016)

    Google Scholar 

  40. Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)

    Google Scholar 

  41. Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR (2016)

    Google Scholar 

  42. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 DAVIS challenge on video object segmentation. In: DAVIS Challenge - CVPR Workshops (2017)

    Google Scholar 

  43. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: BMVC (2017)

    Google Scholar 

  44. Wang, P., et al.: Understanding convolution for semantic segmentation. arXiv:1702.08502 (2017)

  45. Wang, W., Shen, J.: Super-trajectory for video segmentation. arXiv:1702.08634 (2017)

  46. Wen, L., Du, D., Lei, Z., Li, S.Z., Yang, M.H.: JOTS: joint online tracking and segmentation. In: CVPR (2015)

    Google Scholar 

  47. Xiao, F., Lee, Y.J.: Track and segment: an iterative unsupervised approach for video object proposals. In: CVPR (2016)

    Google Scholar 

  48. Yeh, R., Xiong, J., Hwu, W.M., Do, M., Schwing, A.: Interpretable and globally optimal prediction for textual grounding using image concepts. In: NIPS (2017)

    Google Scholar 

  49. Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: CVPR (2018)

    Google Scholar 

  50. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  51. Zhang, Y., Yuan, L., Guo, Y., He, Z., Huang, I.A., Lee, H.: Discriminative bimodal networks for visual localization and detection with natural language queries. In: CVPR (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anna Khoreva .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7468 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khoreva, A., Rohrbach, A., Schiele, B. (2019). Video Object Segmentation with Language Referring Expressions. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11364. Springer, Cham. https://doi.org/10.1007/978-3-030-20870-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20870-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20869-1

  • Online ISBN: 978-3-030-20870-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics