Abstract
We address the problem of grounding representations of word meaning. Our approach learns higher level representations in a stacked autoencoder architecture from visual and textual input. The two input modalities are encoded as vectors of attributes and are obtained automatically from images and text. To obtain visual attributes (e.g. has_legs, is_yellow) from images, we train attribute classifiers by using our large-scale taxonomy of 600 visual attributes, representing more than 500 concepts and 700 K images. We extract textual attributes (e.g. bird, breed) from text with an existing distributional model. Experimental results on tasks related to word similarity show that the attribute-based vectors can be usefully integrated by our stacked autoencoder model to create bimodal representations which are overall more accurate than representations based on the individual modalities or different integration mechanisms (The work presented in this chapter is based on [89]).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We use the term word to denote any sequence of non-delimiting symbols.
- 2.
We use the term concept to denote the mental representation of objects belonging to basic-level classes (e.g. dog), and the term category to refer to superordinate-level classes (e.g. animal).
- 3.
By the term attributes we refer to semantic properties or characteristics of concepts (or categories), expressed by words which people would use to describe their meaning.
- 4.
Available at http://homepages.inf.ed.ac.uk/csilbere/resources.html.
- 5.
In the context of semantic representations, attributes are often called features or properties in the literature. For the sake of consistency of the present work, we will adhere to the former term.
- 6.
- 7.
Available at http://www.image-net.org.
- 8.
Available at http://homepages.inf.ed.ac.uk/s1151656/resources.html.
- 9.
The code by [28] is available at http://vision.cs.uiuc.edu/attributes/ (last accessed in May 2015).
- 10.
Threshold values ranged from 0 to 0.9 with 0.1 stepsize.
- 11.
For simplicity, we use the symbol w to denote both, the concept and its index. Analogously, symbol a denotes the attribute and its index.
- 12.
The software is available at http://clic.cimec.unitn.it/strudel/.
- 13.
In a one-hot vector (a.k.a. 1-of-N coding), exactly one element is one and the others are zero. In our case, the non-zero element corresponds to the object label.
- 14.
See [89] for more experiments.
- 15.
Available at http://homepages.inf.ed.ac.uk/s1151656/resources.html.
- 16.
The corpus is downloadable from http://wacky.sslmit.unibo.it/doku.php?id=corpora.
- 17.
We performed random search over combinations of hyper-parameter values.
- 18.
Available at http://w3.usf.edu/FreeAssociation.
- 19.
We thank Elia Bruni for providing us with their data.
- 20.
- 21.
The vectors are available at https://code.google.com/p/word2vec/.
- 22.
Available at http://homepages.inf.ed.ac.uk/s0897549/data/.
References
Agirre, E., Soroa, A.: SemEval-2007 Task 02: Evaluating word sense induction and discrimination systems. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (2007)
Andrews, M., Vigliocco, G., Vinson, D.: Integrating experiential and distributional data to learn semantic representations. Psychol. Rev. 116(3), 463–498 (2009)
Barbu, E.: Combining methods to learn feature-norm-like concept descriptions. In: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics (2008)
Baroni, M., Murphy, B., Barbu, E., Poesio, M.: Strudel: a corpus-based semantic model based on properties and types. Cogn. Sci. 34(2), 222–254 (2010)
Barsalou, L.: Perceptual symbol systems. Behav. Brain Sci. 22, 577–609 (1999)
Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. (JMLR) 3, 1137–1155 (2003)
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Conference on Neural Information Processing Systems (NIPS) (2006)
Biemann, C.: Chinese whispers—an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of TextGraphs: The 1st Workshop on Graph Based Methods for Natural Language Processing (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. (JMLR) 3, 993–1022 (2003)
Bornstein, M.H., Cote, L.R., Maital, S., Painter, K., Park, S.-Y., Pascual, L.: Cross-linguistic analysis of vocabulary in young children: Spanish, Dutch, French, Hebrew, Italian, Korean, and American English. Child Dev. 75(4), 1115–1139 (2004)
Bruni, E., Tran, G., Baroni, M.: Distributional semantics from text and images. In: Proceedings of the GEMS 2011 workshop on geometrical models of natural language semantics (2011)
Bruni, E., Boleda, G., Baroni, M., Tran, N.: Distributional semantics in technicolor. In: Proceedings of the 50th annual meeting of the association for computational linguistics (2012)
Bruni, E., Bordignon, U., Liska, A., Uijlings, J., Sergienya, I.: VSEM: an open library for visual semantics representation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations (2013)
Bruni, E., Tran, N., Baroni, M.: Multimodal distributional semantics. J. Artif. Intel. Res. (JAIR) 49, 1–47 (2014)
Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: European Conference on Computer Vision (ECCV) (2012)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Collins, A.M., Loftus, E.F.: A spreading-activation theory of semantic processing. Psychol. Rev. 82(6), 407 (1975)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: International Conference on Machine Learning (ICML) (2008)
Cree, G.S., McRae, K., McNorgan, C.: An attractor model of lexical conceptual processing: simulating semantic priming. Cogn. Sci. 23(3), 371–414 (1999)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41(6), 391–407 (1990)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Devereux, B., Pilkington, N., Poibeau, T., Korhonen, A.: Towards unrestricted, large-scale acquisition of feature-based conceptual representations from corpus data. Res. Lang. Comput. 7(2–4), 137–170 (2009)
Devereux, B.J., Tyler, L.K., Geertzen, J., Randall, B.: The centre for speech, language and the brain (CSLB) concept property norms. Behav. Res. Methods (2013)
Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-grained recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2008 results (2008)
Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. (JMLR) 9, 1871–1874 (2008)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Fellbaum, C. (ed.) WordNet: an electronic lexical database. The MIT Press (1998)
Feng, F., Li, R., Wang, X.: Constructing hierarchical image-tags bimodal representations for word tags alternative choice. In: Proceedings of the ICML Workshop on Challenges in Representation Learning (2013)
Feng, Y., Lapata, M.: Visual information in semantic representation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010)
Ferrari, V., Zisserman, A.: Learning visual attributes. In: Conference on Neural Information Processing Systems (NIPS) (2007)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inform. Syst. 20(1), 116–131 (2002)
Fountain, T., Lapata, M.: Meaning representation in natural language categorization. In: Proceedings of the 31st Annual Conference of the Cognitive Science Society (2010)
Frermann, L., Lapata, M.: Incremental Bayesian learning of semantic categories. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (2014)
Glenberg, A.M., Kaschak, M.P.: Grounding language in action. Psychon. Bull. Rev. 9(3), 558–565 (2002)
Goldstone, R.L., Kersten, A., Cavalho, P.F.: Concepts and categorization. In: Healy, A.F., Proctor, R.W. (eds.) Comprehensive Handbook of Psychology, vol. 4: Experimental Psychology, pp. 607–630. Wiley (2012)
Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychol. Rev. 114(2), 211–244 (2007)
Grondin, R., Lupker, S., Mcrae, K.: Shared features dominate semantic richness effects for concrete concepts. J. Mem. Lang. 60(1), 1–19 (2009)
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Hill, F., Korhonen, A.: Learning abstract concept embeddings from multi-modal data: since you probably cant see what I mean. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hsu, A.S., Martin, J.B., Sanborn, A.N., Griffiths, T.L.: Identifying representations of categories of discrete items using Markov Chain Monte Carlo with people. In: Proceedings of the 34th annual conference of the cognitive science society (2012)
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers (2012)
Huang, J., Kingsbury, B.: Audio-visual deep learning for noise robust speech recognition. In: Proceedings 38th International Conference on Acoustics, Speech, and Signal Processing (2013)
Huiskes, M.J., Lew, M.S.: The MIR Flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval (2008)
Johns, B.T., Jones, M.N.: Perceptual inference through global lexical similarity. Topics Cogn. Sci. 4(1), 103–120 (2012)
Jones, M.N., Willits, J.A., Dennis, S.: Models of semantic memory. In: Busemeyer, J., Townsend, J., Wang, Z., Eidels, A. (eds.) The Oxford Handbook of Computational and Mathematical Psychology, pp. 232–254. Oxford University Press (2015)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Kelly, C., Devereux, B., Korhonen, A.: Acquiring human-like feature-based conceptual representations from corpora. In: NAACL HLT Workshop on Computational Neurolinguistics (2010)
Kiela, D., Bottou, L.: Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (2014)
Kim, Y., Lee, H., Provost, E.M.: Deep learning for robust feature generation in audiovisual emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2013)
Kiros, R., Salakhutdinov, R., Zemel, R.: Unifying visual-semantic embeddings with multimodal neural language models. NIPS. In: Deep Learning and Representation Learning Workshop (2014)
Kumar, N., Belhumeur, P.N., Nayar, S.K.: FaceTracer: a search engine for large collections of images with faces. In: European Conference on Computer Vision (ECCV) (2008)
Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Describable visual attributes for face verification and image search. IEEE Trans. pattern Anal. Mach. Intel. (PAMI) 33(10), 1962–1977 (2011)
Laffont, P.-Y., Ren, Z., Tao, X., Qian, C., Hays, J.: Transient attributes for high-level understanding and editing of outdoor scenes. ACM Trans. Graph. 33(4), 149:1–149:11 (2014)
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Landau, B., Smith, L., Jones, S.: Object perception and object naming in early development. Trends Cogn. Sci. 2(1), 19–24 (1998)
Landauer, T., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104(2), 211–240 (1997)
Lazaridou, A., Pham, N.T., Baroni, M.: Combining language and vision with a multimodal skip-gram model. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2015)
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision (IJCV) 60(2), 91–110 (2004)
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Methods Instrum. Comput. 28(2), 203–208 (1996)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks. In: Deep Learning and Representation Learning Workshop: NIPS (2014)
McRae, K., Jones, M.: Semantic memory. In: Reisberg, D. (ed.) The Oxford Handbook of Cognitive Psychology. Oxford University Press (2013)
McRae, K., Cree, G.S., Seidenberg, M.S., McNorgan, C.: Semantic feature production norms for a large set of living and nonliving things. Behav. Res. Methods 37(4), 547–559 (2005)
Medin, D.L., Schaffer, M.M.: Context theory of classification learning. Psychol. Rev. 85(3), 207–238 (1978)
Mervis, C.B., Rosch, E.: Categorization of natural objects. Annu. Rev. Psychol. 32(1), 89–115 (1981)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Conference on Neural Information Processing Systems (NIPS) (2013)
Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Conference on Neural Information Processing Systems (NIPS) (2009)
Nelson, D.L., McEvoy, C.L., Schreiber, T.A.: The University of South Florida Word Association, Rhyme, and Word Fragment Norms (1998)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.: Multimodal deep learning. In: International Conference on Machine Learning (ICML) (2011)
O’Connor, C.M., Cree, G.S., McRae, K.: Conceptual hierarchies in a flat attractor network: dynamics of learning and computations. Cogn. Sci. 33(4), 665–708 (2009)
Osherson, D.N., Stern, J., Wilkie, O., Stob, M., Smith, E.E.: Default probability. Cogn. Sci. 2(15), 251–269 (1991)
Parikh, D., Grauman, K.: Relative attributes. In: International Conference on Computer Vision (ICCV) (2011)
Patterson, G., Xu, C., Su, H., Hays, J.: The SUN attribute database: beyond categories for deeper scene understanding. Int. J. Comput. Vision (IJCV) 108(1–2), 59–81 (2014)
Patwardhan, S., Pedersen, T.: Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In: Proceedings of the EACL 2006 Workshop on Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together (2006)
Perfetti, C.: The limits of co-occurrence: tools and theories in language research. Discourse Processes 25(2&3), 363–377 (1998)
Ranzato, M., Szummer, M.: Semi-supervised learning of compact document representations with deep networks. In: International Conference on Machine Learning (ICML) (2008)
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: Conference on Neural Information Processing Systems (NIPS) (2006)
Rastegari, M., Diba, A., Parikh, D., Farhadi, A.: Multi-attribute queries: to merge or not to merge? In: Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Rogers, T.T., McClelland, J.L.: Semantic Cognition: A Parallel Distributed Processing Approach. A Parallel Distributed Processing Approach. The MIT Press (2004)
Rogers, T.T., Lambon Ralph, M.A., Garrard, P., Bozeat, S., McClelland, J.L., Hodges, J.R., Patterson, K.: Structure and deterioration of semantic memory: a neuropsychological and computational investigation. Psychol. Rev. 111(1), 205–235 (2004)
Roller, S., Schulte im Walde, S.: A Multimodal LDA model integrating textual, cognitive and visual modalities. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, pp. 318–362. The MIT Press (1986)
Russakovsky, O., Fei-Fei, L.: Attribute learning in large-scale datasets. In: ECCV International Workshop on Parts and Attributes (2010)
Russell, B., Torralba, A., Murphy, K., Freeman, W.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. (IJCV) 77, 157–173 (2008)
Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, Inc. (1986)
Silberer, C.: Learning Visually Grounded Meaning Representations. Ph.D. thesis, Institute for Language, Cognition and Computation, School of Informatics, The University of Edinburgh (2015)
Silberer, C., Lapata, M.: Grounded models of semantic representation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2012)
Sloman, S.A., Love, B.C., Ahn, W.-K.: Feature centrality and conceptual coherence. Cogn. Sci. 22(2), 189–228 (1998)
Smith, E.E., Shoben, E.J., Rips, L.J.: Structure and process in semantic memory: a featural model for semantic decisions. Psychol. Rev. 81(3), 214–241 (1974)
Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., and Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011)
Socher, R., Karpathy, A., Le, Q.V., Manning, C., Ng, A.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Sohn, K., Shang, W., Lee, H.: Improved multimodal deep learning with variation of information. In: Conference on Neural Information Processing Systems (NIPS) (2014)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep Boltzmann machines. In: Conference on Neural Information Processing Systems (NIPS) (2012)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. (JMLR) 15, 2949–2980 (2014)
Szumlanski, S., Gomez, F., Sims, V.K.: A new set of norms for semantic relatedness measures. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (2013)
Taylor, K.I., Devereux, B.J., Acres, K., Randall, B., Tyler, L.K.: Contrasting effects of feature-based statistics on the categorisation and basic-level identification of visual objects. Cognition 122(3), 363–374 (2012)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
Tyler, L.K., Moss, H.E.: Towards a distributed account of conceptual knowledge. TRENDS Cogn. Sci. 5(6), 244–252 (2001)
Vanpaemel, W., Storms, G., Ons, B.: A varying abstraction model for categorization. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society (2005)
Varma, M., Zisserman, A.: A statistical approach to texture classification from single images. Int. J. Comput. Vis. (IJCV) (Special Issue on Texture Analysis and Synthesis) 62(1–2), pp. 61–81 (2005)
Vigliocco, G., Vinson, D.P., Lewis, W., Garrett, M.F.: Representing the meanings of object and action words: the featural and unitary semantic space hypothesis. Cogn. Psychol. 48(4), 422–488 (2004)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML) (2008)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. (JMLR) 11, 3371–3408 (2010)
Vinson, D.P., Vigliocco, G.: Semantic feature production norms for a large set of objects and events. Behav. Res. Methods 40(1), 183–190 (2008)
von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Conference on Human Factors in Computing Systems (2004)
Voorspoels, W., Vanpaemel, W., Storms, G.: Exemplars and prototypes in natural language concepts: a typicality-based evaluation. Psychon. Bull. Rev. 15, 630–637 (2008)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology (2011)
Westermann, G., Mareschal, D.: From perceptual to language-mediated categorization. Philos. Trans. R Soc. B: Biol. Sci. 369(1634), 20120391 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Silberer, C. (2017). Grounding the Meaning of Words with Visual Attributes. In: Feris, R., Lampert, C., Parikh, D. (eds) Visual Attributes. Advances in Computer Vision and Pattern Recognition. Springer, Cham. https://doi.org/10.1007/978-3-319-50077-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-50077-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50075-1
Online ISBN: 978-3-319-50077-5
eBook Packages: Computer ScienceComputer Science (R0)