Advertisement

Key-Word-Aware Network for Referring Expression Image Segmentation

  • Hengcan Shi
  • Hongliang LiEmail author
  • Fanman Meng
  • Qingbo Wu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Referring expression image segmentation aims to segment out the object referred by a natural language query expression. Without considering the specific properties of visual and textual information, existing works usually deal with this task by directly feeding a foreground/background classifier with cascaded image and text features, which are extracted from each image region and the whole query, respectively. On the one hand, they ignore that each word in a query expression makes different contributions to identify the desired object, which requires a differential treatment in extracting text feature. On the other hand, the relationships of different image regions are not considered as well, even though they are greatly important to eliminate the undesired foreground object in accordance with specific query. To address aforementioned issues, in this paper, we propose a key-word-aware network, which contains a query attention model and a key-word-aware visual context model. In extracting text features, the query attention model attends to assign higher weights for the words which are more important for identifying object. Meanwhile, the key-word-aware visual context model describes the relationships among different image regions, according to corresponding query. Our proposed method outperforms state-of-the-art methods on two referring expression image segmentation databases.

Keywords

Referring expression image segmentation Key word extraction Query attention Key-word-aware visual context 

Notes

Acknowledgement

This work was supported in part by National Natural Science Foundation of China (No. 61525102, 61601102 and 61502084).

References

  1. 1.
    Caesar, H., Uijlings, J., Ferrari, V.: Region-based semantic segmentation with end-to-end training. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part I. LNCS, vol. 9905, pp. 381–397. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_23CrossRefGoogle Scholar
  2. 2.
    Chen, L.C., Papandreou, G., Kokkinos, I., et al.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)CrossRefGoogle Scholar
  3. 3.
    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. CoRR (2017)Google Scholar
  4. 4.
    Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3992–4000 (2015)Google Scholar
  5. 5.
    Gupta, S., Arbeláez, P., Girshick, R., Malik, J.: Indoor scene understanding with RGB-D images: bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 112(2), 133–149 (2015)MathSciNetCrossRefGoogle Scholar
  6. 6.
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 564–571 (2013)Google Scholar
  7. 7.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014 Part VII. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_23CrossRefGoogle Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  9. 9.
    Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: European Conference on Computer Vision (2016)Google Scholar
  10. 10.
    Hu, R., Rohrbach, M., Venugopalan, S., Darrell, T.: Utilizing large scale vision and text datasets for image segmentation from referring expressions. CoRR (2016)Google Scholar
  11. 11.
    Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Computer Vision and Pattern Recognition, pp. 4555–4564 (2016)Google Scholar
  12. 12.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)Google Scholar
  13. 13.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing, pp. 787–798 (2014)Google Scholar
  14. 14.
    Li, H., Meng, F., Wu, Q., Luo, B.: Unsupervised multiclass region cosegmentation via ensemble clustering and energy minimization. IEEE Trans. Circuits Syst. Video Technol. 24(5), 789–801 (2014)CrossRefGoogle Scholar
  15. 15.
    Li, Z., Gan, Y., Liang, X., Yu, Y., Cheng, H., Lin, L.: LSTM-CF: unifying context modeling and fusion with LSTMs for RGB-D scene labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part II. LNCS, vol. 9906, pp. 541–557. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_34CrossRefGoogle Scholar
  16. 16.
    Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part I. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_8CrossRefGoogle Scholar
  17. 17.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014 Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  18. 18.
    Liu, C., Lin, Z., Shen, X., Yang, J., Lu, X., Yuille, A.: Recurrent multimodal interaction for referring image segmentation. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  19. 19.
    Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. CoRR abs/1506.04579 (2015)Google Scholar
  20. 20.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
  21. 21.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  22. 22.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)Google Scholar
  23. 23.
    Luo, B., Li, H., Meng, F., Wu, Q., Huang, C.: Video object segmentation via global consistency aware query strategy. IEEE Trans. Multimed. PP(99), 1 (2017)Google Scholar
  24. 24.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)Google Scholar
  25. 25.
    Meng, F., Li, H., Wu, Q., Luo, B., Huang, C., Ngan, K.: Globally measuring the similarity of superpixels by binary edge maps for superpixel clustering. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1 (2016)Google Scholar
  26. 26.
    Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part IV. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_48CrossRefGoogle Scholar
  27. 27.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)Google Scholar
  28. 28.
    Peng, Z., Zhang, R., Liang, X., Liu, X., Lin, L.: Geometric scene parsing with hierarchical LSTM. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 3439–3445 (2016)Google Scholar
  29. 29.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)Google Scholar
  31. 31.
    Wang, G., Luo, P., Lin, L., Wang, X.: Learning object interactions and descriptions for semantic image segmentation. In: CVPR (2017)Google Scholar
  32. 32.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  33. 33.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and pattern Recognition (2016)Google Scholar
  34. 34.
    Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)CrossRefGoogle Scholar
  35. 35.
    Yu, D., Fu, J., Rui, Y., Mei, T.: Multi-level attention networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017Google Scholar
  36. 36.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International Conference on Learning Representations (2016)Google Scholar
  37. 37.
    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part II. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_5CrossRefGoogle Scholar
  38. 38.
    Zhang, Y., Yuan, L., Guo, Y., He, Z., Huang, I., Lee, H.: Discriminative bimodal networks for visual localization and detection with natural language queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  39. 39.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Hengcan Shi
    • 1
  • Hongliang Li
    • 1
    Email author
  • Fanman Meng
    • 1
  • Qingbo Wu
    • 1
  1. 1.School of Information and Communication EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina

Personalised recommendations