Skip to main content

Describing Textures Using Natural Language

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Abstract

Textures in natural images can be characterized by color, shape, periodicity of elements within them, and other attributes that can be described using natural language. In this paper, we study the problem of describing visual attributes of texture on a novel dataset containing rich descriptions of textures, and conduct a systematic study of current generative and discriminative models for grounding language to images on this dataset. We find that while these models capture some properties of texture, they fail to capture several compositional properties, such as the colors of dots. We provide critical analysis of existing models by generating synthetic but realistic textures with different descriptions. Our dataset also allows us to train interpretable models and generate language-based explanations of what discriminative features are learned by deep networks for fine-grained categorization where texture plays a key role. We present visualizations of several fine-grained domains and show that texture attributes learned on our dataset offer improvements over expert-designed attributes on the Caltech-UCSD Birds dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. FastText pretrained embeddings. https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

  2. FGVC Butterflies and Moths Dataset. https://sites.google.com/view/fgvc6/competitions/butterflies-moths-2019

  3. Pretrained BERT of version “bert-base-uncased”. https://huggingface.co/transformers/pretrained_models.html

  4. Pretrained ELMo. https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5

  5. Amadasun, M., King, R.: Textural features corresponding to textural properties. IEEE Trans. Syst. Man Cybern. 19(5), 1264–1274 (1989)

    Article  Google Scholar 

  6. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086 (2018)

    Google Scholar 

  7. Antol, S., et al.: VQA: visual question answering. In: IEEE International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  8. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5297–5307 (2016)

    Google Scholar 

  9. Bajcsy, R.: Computer description of textured surfaces. In: Proceedings of the 3rd International Joint Conference on Artificial Intelligence, pp. 572–579 (1973)

    Google Scholar 

  10. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, Ann Arbor (June 2005). https://www.aclweb.org/anthology/W05-0909

  11. Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3479–3487 (2015)

    Google Scholar 

  12. Bhushan, N., Rao, A.R., Lohse, G.L.: The texture lexicon: understanding the categorization of visual texture terms and their relationship to texture images. Cogn. Sci. 21(2), 219–246 (1997)

    Article  Google Scholar 

  13. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. (TACL) 5, 135–146 (2017)

    Article  Google Scholar 

  14. Brendel, W., Bethge, M.: Approximating CNNs with Bag-of-Local-Features Models works surprisingly well on ImageNet. In: International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  15. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

    Google Scholar 

  16. Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  17. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision. ECCV, Prague, vol. 1, pp. 1–2 (2004)

    Google Scholar 

  18. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)

    Google Scholar 

  19. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  20. Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6639–6648 (2019)

    Google Scholar 

  21. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016)

    Google Scholar 

  22. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  24. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  25. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7

    Chapter  Google Scholar 

  26. Hosseini, H., Xiao, B., Jaiswal, M., Poovendran, R.: Assessing shape bias property of convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2018)

    Google Scholar 

  27. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)

    Google Scholar 

  28. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, pp. 1564–1574 (2018)

    Google Scholar 

  29. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  30. Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vis. (IJCV) 43(1), 29–44 (2001)

    Article  Google Scholar 

  31. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, p. 10 (January 2004)

    Google Scholar 

  32. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  33. Lin, T.Y., Maji, S.: Visualizing and understanding deep texture representations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2791–2799 (2016)

    Google Scholar 

  34. Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: IEEE International Conference on Computer Vision (ICCV) (2015)

    Google Scholar 

  35. Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 40(6), 1309–1322 (2018)

    Article  Google Scholar 

  36. Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. In: IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  37. Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1950–1959 (2019)

    Google Scholar 

  38. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  39. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) (December 2008)

    Google Scholar 

  40. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  41. Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3384–3391. IEEE (2010)

    Google Scholar 

  42. Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237 (June 2018)

    Google Scholar 

  43. Plummer, B.A., Kordas, P., Kiapour, M.H., Zheng, S., Piramuthu, R., Lazebnik, S.: Conditional image-text embedding networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 258–274. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_16

    Chapter  Google Scholar 

  44. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649 (2015)

    Google Scholar 

  45. Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vis. (IJCV) 40(1), 49–70 (2000)

    Article  Google Scholar 

  46. Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. Int. J. Comput. Vis. (IJCV) 105(3), 222–245 (2013)

    Article  MathSciNet  Google Scholar 

  47. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  48. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)

    Google Scholar 

  49. Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual perception. IEEE Trans. Syst. Man Cybern. 8(6), 460–473 (1978)

    Article  Google Scholar 

  50. Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)

  51. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)

    Google Scholar 

  52. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)

    Google Scholar 

  53. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report, CNS-TR-2011-001, California Institute of Technology (2011)

    Google Scholar 

  54. Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., Hengel, A.: Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1960–1968 (2019)

    Google Scholar 

  55. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML), pp. 2048–2057 (2015)

    Google Scholar 

  56. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) 2, 67–78 (2014)

    Article  Google Scholar 

  57. Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  58. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  59. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6281–6290 (2019)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Mohit Iyyer for helpful discussions and feedback. The project is supported in part by NSF grants #1749833 and #1617917. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the Mass. Technology Collaborative.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenyun Wu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 18614 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, C., Timm, M., Maji, S. (2020). Describing Textures Using Natural Language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58452-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58451-1

  • Online ISBN: 978-3-030-58452-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics