Describing Textures Using Natural Language

Wu, Chenyun; Timm, Mikayla; Maji, Subhransu

doi:10.1007/978-3-030-58452-8_4

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12346))

Included in the following conference series:

European Conference on Computer Vision

13k Accesses
4 Citations

Abstract

Textures in natural images can be characterized by color, shape, periodicity of elements within them, and other attributes that can be described using natural language. In this paper, we study the problem of describing visual attributes of texture on a novel dataset containing rich descriptions of textures, and conduct a systematic study of current generative and discriminative models for grounding language to images on this dataset. We find that while these models capture some properties of texture, they fail to capture several compositional properties, such as the colors of dots. We provide critical analysis of existing models by generating synthetic but realistic textures with different descriptions. Our dataset also allows us to train interpretable models and generate language-based explanations of what discriminative features are learned by deep networks for fine-grained categorization where texture plays a key role. We present visualizations of several fine-grained domains and show that texture attributes learned on our dataset offer improvements over expert-designed attributes on the Caltech-UCSD Birds dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

FastText pretrained embeddings. https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
FGVC Butterflies and Moths Dataset. https://sites.google.com/view/fgvc6/competitions/butterflies-moths-2019
Pretrained BERT of version “bert-base-uncased”. https://huggingface.co/transformers/pretrained_models.html
Pretrained ELMo. https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
Amadasun, M., King, R.: Textural features corresponding to textural properties. IEEE Trans. Syst. Man Cybern. 19(5), 1264–1274 (1989)
Article Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5297–5307 (2016)
Google Scholar
Bajcsy, R.: Computer description of textured surfaces. In: Proceedings of the 3rd International Joint Conference on Artificial Intelligence, pp. 572–579 (1973)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, Ann Arbor (June 2005). https://www.aclweb.org/anthology/W05-0909
Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3479–3487 (2015)
Google Scholar
Bhushan, N., Rao, A.R., Lohse, G.L.: The texture lexicon: understanding the categorization of visual texture terms and their relationship to texture images. Cogn. Sci. 21(2), 219–246 (1997)
Article Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. (TACL) 5, 135–146 (2017)
Article Google Scholar
Brendel, W., Bethge, M.: Approximating CNNs with Bag-of-Local-Features Models works surprisingly well on ImageNet. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Google Scholar
Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision. ECCV, Prague, vol. 1, pp. 1–2 (2004)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6639–6648 (2019)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423 (2016)
Google Scholar
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A., Brendel, W.: ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Chapter Google Scholar
Hosseini, H., Xiao, B., Jaiswal, M., Poovendran, R.: Assessing shape bias property of convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2018)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798 (2014)
Google Scholar
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, pp. 1564–1574 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vis. (IJCV) 43(1), 29–44 (2001)
Article Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, p. 10 (January 2004)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, T.Y., Maji, S.: Visualizing and understanding deep texture representations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2791–2799 (2016)
Google Scholar
Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 40(6), 1309–1322 (2018)
Article Google Scholar
Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Liu, X., Wang, Z., Shao, J., Wang, X., Li, H.: Improving referring expression grounding with cross-modal attention-guided erasing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1950–1959 (2019)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) (December 2008)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3384–3391. IEEE (2010)
Google Scholar
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237 (June 2018)
Google Scholar
Plummer, B.A., Kordas, P., Kiapour, M.H., Zheng, S., Piramuthu, R., Lazebnik, S.: Conditional image-text embedding networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 258–274. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_16
Chapter Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649 (2015)
Google Scholar
Portilla, J., Simoncelli, E.P.: A parametric texture model based on joint statistics of complex wavelet coefficients. Int. J. Comput. Vis. (IJCV) 40(1), 49–70 (2000)
Article Google Scholar
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. Int. J. Comput. Vis. (IJCV) 105(3), 222–245 (2013)
Article MathSciNet Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)
Google Scholar
Tamura, H., Mori, S., Yamawaki, T.: Textural features corresponding to visual perception. IEEE Trans. Syst. Man Cybern. 8(6), 460–473 (1978)
Article Google Scholar
Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report, CNS-TR-2011-001, California Institute of Technology (2011)
Google Scholar
Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., Hengel, A.: Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1960–1968 (2019)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML), pp. 2048–2057 (2015)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) 2, 67–78 (2014)
Article Google Scholar
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6281–6290 (2019)
Google Scholar

Download references

Acknowledgements

We would like to thank Mohit Iyyer for helpful discussions and feedback. The project is supported in part by NSF grants #1749833 and #1617917. Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the Mass. Technology Collaborative.

Author information

Authors and Affiliations

University of Massachusetts Amherst, Amherst, USA
Chenyun Wu, Mikayla Timm & Subhransu Maji

Authors

Chenyun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Mikayla Timm
View author publications
You can also search for this author in PubMed Google Scholar
Subhransu Maji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenyun Wu .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 18614 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, C., Timm, M., Maji, S. (2020). Describing Textures Using Natural Language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12346. Springer, Cham. https://doi.org/10.1007/978-3-030-58452-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-58452-8_4
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58451-1
Online ISBN: 978-3-030-58452-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics