Introduction

Increasing evidence shows that vision, action and language should not be regarded as a set of disembodied processes. Instead, they form a closely integrated and highly dynamic system that is attuned to the constraints of its bodily implementation as well as to the constraints coming from the world with which this body interacts. One consequence of such embodiment of cognition is that seeing an object, even when there is no intention to handle it, activates plans for actions directed toward it (e.g., Tucker & Ellis, 1998, 2001; Fischer & Dahl, 2007). Using object names induces similar action planning effects as seeing the objects themselves (Tucker & Ellis, 2004; Borghi, Glenberg & Kaschak, 2004). Depending on linguistic context, different object features can be activated for action planning, as indicated by facilitated manual responses or “affordance effects” (e.g., Borghi, 2004; Glenberg & Robertson, 2000; Zwaan, 2004). Similarly, different action intentions direct attention differently to object features for processing (e.g., Bekkering & Neggers, 2002; Fischer & Hoellen, 2004; Symes, Tucker, Ellis, Vainio, & Ottoboni, 2008). Eye movements during visually guided actions shed further light on the close relationship between vision, action and language (Land & Furneaux, 1997; Johansson, Westling, Bäckström, & Flanagan, 2001). For example when humans interact with objects, their eyes move ahead of their hands to support the on-line control of grasping (e.g., Bekkering & Neggers, 2002).

These behavioral results are supported by brain imaging studies of object affordances in humans (e.g., Grèzes, Tucker, Armony, Ellis, & Passingham, 2003) and single cell recordings in monkeys (e.g., Sakata, Taira, Mine, & Murata, 1992; Fadiga, Fogassi, Gallese, & Rizzolatti, 2000). Together, these behavioral and neuroscientific studies have recently begun to inform computational models of embodied cognition. For example, Tsiotas, Borghi and Parisi (2005) devised an artificial life simulation to give an evolutionary account of some affordance effects, and Caligiore, Borghi, Parisi, and Baldassarre (2010) proposed a computational model to account for several affordance-related effects in grasping, reaching, and language. The neuroscientific constraints implemented in the design of the model allow its authors to investigate the neural mechanisms underlying affordance selection and control. The present special issue brings together recent developments at the intersection between behavioral, neuroscientific, and computational approaches to embodied cognition.

Strong support for the close link between vision, action and language comes from studies which highlight how language processing and comprehension make use of neural systems ordinarily used for perception and action (Lakoff, 1987; Zwaan, 2004; Barsalou, 1999; Glenberg & Robertson, 1999; Gallese, 2008; Glenberg, 2010). For example, when humans process the word “cup” they seem to reenact (and therefore internally simulate) many of the perceptual, motor and affective representations related to a cup (Barsalou, 1999). In a similar way sentences and abstract words are understood by creating a simulation of the actions underlying them (Glenberg and Kaschak, 2002; see also Borghi & Cimatti, 2009, for a new formulation of embodiment of abstract words which includes social aspects). Moreover, when hearing a verbal description of a visually available scene humans tend to look at objects that are about to be mentioned (Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995), indicating rapid and predictive comprehension that is tightly linked to action. Several computational models have therefore implemented simple learning mechanisms (such as Hebbian rules) to create associations between patterns of active neurons representing the phonological aspects of words and internal simulations (i.e., representations of object features involved in perception and action, cf. Jeannerod, 2007; Mayor & Plunkett, 2010; Caligiore et al., 2010; Li, Farkas, & MacWhinney, 2004).

Grounded cognition theories have found a neurophysiological basis in the recent discovery, in monkeys as well as in humans, of two kinds of visuomotor neurons: canonical and mirror neurons (Rizzolatti & Craighero, 2004; Buccino, Binkofski, & Riggio, 2004). Canonical neurons discharge to the visual presentation of objects that can be grasped with a specific type of prehension (object directed action), motorically coded by these neurons even when a grasping movement is not required. Mirror neurons, instead, fire when the monkey makes a goal-directed action and also when it observes another monkey or an experimenter performing the same or a similar action. Recent studies, mainly based on brain imaging techniques, indicate the existence of both canonical and mirror neurons in humans (Buccino et al., 2001; Grèzes et al., 2003; Johnson-Frey et al., 2003; Fadiga et al., 2006). Using fMRI it has been shown that in humans the observation of both object directed actions and mimed actions leads to activation of different regions in the premotor cortex, including the Broca’s region (Buccino et al., 2001) and the parietal cortex (Fogassi, Ferrari, Gesierich, Rozzi, Chersi, & Rizzolatti, 2005). The relationship between canonical and mirror neurons and their roles in different cognitive functions, including language processing, has to be better investigated (see Thill, Caligiore, Borghi, Ziemke, & Baldassarre 2012, submitted, for an up-to-date review on these topics).

In an influential paper, Rizzolatti and Arbib (1998) proposed that the matching process embodied by mirror neurons represents the basic mechanism from which language evolved. In the last decade this claim has been strongly supported by a series of experimental studies (for reviews see Pulvermüller, 2005; Willems & Hagoort, 2007). First, in an event-related fMRI study, the silent reading of words referring to face, arm or leg actions activated premotor–motor areas related to the word meanings (Hauk, Johnsrude, & Pulvermüller, 2004). An MEG study showed that reading action verbs activates motor and premotor cortices both rapidly and in a somatotopic fashion (Pulvermüller, 2005), thus suggesting that motor activation is inherent to lexical processing. In a further fMRI study, listening to sentences expressing mouth, hand and foot actions produced activation of effector-congruent sectors of the premotor cortex (Tettamanti et al., 2005). Interestingly, these distinct sectors coincide, albeit only approximately, with those active during the observation of hand, mouth and foot actions (Buccino et al., 2001).

These data support the notion that the mirror neuron system is involved not only in understanding visually presented actions, but also in coding acoustically presented action-related sentences. Several studies showed that similar mechanisms of motor resonance are active when we understand hand and mouth actions including speech production. First, grasping movements influence syllable pronunciation when executed (Gentilucci, Benuzzi, Gangitano, & Grimaldi, 2001) as well as when merely observed (Gentilucci, 2003). Second, both listening and observing speech movements causes an increase of motor evoked potentials recorded from tongue and lip muscles (Watkins, Strafella, & Paus, 2003; Pulvermüller & Fadiga, 2010). Finally, evidence for a link between gesturing and the speech system also comes from clinical studies: Hanlon, Brown and Gerstman (1990) showed that aphasic patients’ object naming benefits from pointing with the right hand to the referents.

Investigations into the integration of vision, action and language have greatly benefited from the use of computational models. The linguistic abilities of an artificial agent whose behavior is established by a computational model is strictly dependent on, and grounded in, other perceptual and motor skills (MacWhinney, 1998; Cangelosi & Riga, 2006; Cangelosi & Parisi, 2002). Such a grounded and embodied approach to language design is consistent with the theories of the grounding of language discussed above. In these models there exists an intrinsic link between the communication symbols (words) used by the agent and its own cognitive representations (meanings) of the perceptual and sensorimotor interaction with the external world (referents) (Steels & Vogt, 1997; Steels, 2003; Yoon, Heinke & Humphreys, 2002). Cangelosi, Hourdakis, and Tikhanoff (2006) proposed a neural network model in a robotic set-up as a model of language acquisition. The authors show how a robot can acquire new concepts of actions via linguistic instructions. Moreover, the associative mechanisms involving words and categorical representations of objects are used to transfer the compositionality properties of language to sensorimotor representations. In the same line Chersi, Thill, Ziemke and Borghi (2010) recently used a computational model to show how sentence processing might involve similar chaining mechanisms as does action sequence organization (Fogassi et al., 2005). Many other embodied modeling issues remain to be resolved (Pezzulo, Barsalou, Cangelosi, Fischer, Spivey, & McRae, 2011).

The various studies mentioned above investigate the integration of vision, action, and language through embodiment from rather different perspectives. We have discussed results from behavioral experiments as well as from neuroscientific and computational modeling. Unfortunately, despite the converging results obtained across these disciplines there is currently very little proper discussion and exchange of views among the experts that use these different perspectives. We believe that this kind of multi-methodological and multidisciplinary discussion is a useful and necessary step to share the advantages of each approach and to achieve a cumulative understanding of the neural mechanisms the brain uses to deal with the integration of vision, action and language into embodied behavior. Instead of stressing the differences between the different approaches, it could be productive to focus on their overlapping traits and on their enormous potential for cross-fertilization. The aim of this special issue is, therefore, to direct the attention of scientists who work with different approaches toward an inter-disciplinary discussion about vision, action and language unified through embodiment.

Content of the special issue

This issue of Psychological Research has a clear focus: understanding how the results from different scientific approaches could be shared to energize the empirical and theoretical discussion on the integration of vision, action and language by embodiment. The contributions to this special issue cover a wide range of methodologies, from psychophysics to computational modeling, from classical behavioral methods to neuropsychological and brain imaging approaches. The majority of contributions goes beyond the popular reaction time methodology and uses movement-related performance to strengthen the case for an embodiment of concept activation.

Some themes explored in the special issue regard: What is the organization (intended at several levels such as brain, functional, computational) needed to support integration of vision, action and language? How does the timing in language processing influence the embodiment of language representations? What is the relationship between canonical and mirror neurons in action and language organization? What is the social influence on affordances perception and organization? Is the motor system involved in understanding of concrete nouns, as it is for concrete verbs? The range of perspectives of the papers proposed in this special issue offers psychological, neuroscientific and computational modeling evidence in the investigation of these questions.

Caligiore, Borghi, Parisi, Ellis, Cangelosi, and Baldassarre (2012) propose an extended version of the TRoPICALS computational model (Caligiore et al., 2010) aimed at better understanding the mechanisms underlying positive as well as negative compatibility effects observed in behavioral experiments. The model addresses the case of distractor objects which, although irrelevant for the agent’s goals, activate affordances that have to be actively suppressed. The simulations fully replicate the findings reported in the literature. The authors further simulate damages to the model that are similar to those found in Parkinson’s Disease in order to predict compatibility effects that might be found with these patients in future experiments.

De Vega, Moreno, and Castillo (2012) present two experiments that look at changes in motor compatibility effects during comprehension based on the relative timing of the motor response to the processing of action-relevant language. The authors show that at short stimulus onset asynchrony, the traditional motor compatibility effect is reversed: participants are faster to respond when the direction of the action in the sentence mismatches the direction of the motor response that needs to be made. The work deals with a timely and important issue, and the data help to reconcile some differing results that have been reported about facilitating and interfering observations of motor compatibility effects.

Ellis, Swabey, Bridgeman, May, Tucker, and Hyne (2012) report a behavioral study to investigate the interaction of the mirror neuron system and the canonical neuron system when humans observe other agents acting on objects irrespective of theirs goals. They make a case for regarding them as different aspects of a common system for orchestrating the actions of agents.

Gianelli, Scorolli, and Borghi (2012) present an empirical study to investigate the effects of social influences on kinematic features of a reaching and grasping movements. They recorded reaching and grasping movements in the presence of a second person which could be either a friend or no friend. The authors demonstrate that the social relationship between a performer and a second person affected kinematic features of the task. Moreover, speaking sentences related to the reaching and grasping task had an effect depending on whether “I” or “you” was used as a pronoun. These results point in the direction of social motor control as a novel field of embodiment research.

Iizuka, Marocco, Ando, and Maeda (2012) present an empirical analysis of how a communication system emerges spontaneously between two interacting individuals in the absence of a specifically predefined communication channel. Participants tried to communicate to each other the identity of viewed objects by sliding fingers on a signaling device. The emerging communication patterns suggested gradual emergence of turn taking, association between behavior and perceptual categories, and the acquisition of novel meanings. These observations investigate the foundations of our sociality.

Marino, Gough, Gallese, Riggio, and Buccino (2012) offer new empirical evidence of embodied meaning associated to action-related nouns rather than verbs. The work addresses the crucial open question of whether the motor system is involved during the understanding of concrete nouns, as it is for concrete verbs. The results are discussed in terms of motor processes in the left brain hemisphere associated with action nouns.

Weiner and Grill-Spector (2012) summarize the results of two recently published studies (Weiner & Grill-Spector, 2010, 2011) that investigated the distribution of face and limb selectivity in human visual cortex. They propose a new three-stream model of high-level visual cortex which includes ventral, lateral and dorsal areas where multimodal processing related to vision, action and language might converge. Just as the other contributions to this special issue, this programmatic proposal sets a framework for a much needed dialog between disciplines.

Toward a common framework to study the embodiment of vision, action and language

The accumulation of evidence in favor of embodied cognition, which comes from such different disciplines as psychology, neuroscience and robotics, confirms the importance of the topic of this special issue for the wider scientific community. However, the multi-disciplinary and multi-methodological nature of the available data raises an important question: is it possible to find a common framework to interpret and explain data deriving from such vastly different methods? This is a crucial point because such a common framework could support cross-fertilization among different disciplines and, importantly, could help to discover general principles underlying the embodiment of vision, action and language. However, considering a framework that is understandable by scientists with dramatically different backgrounds who often use different terminology to indicate the similar phenomena is not trivial (Hommel & Colzato, 2010).

In the last decade some valid attempts, mainly using computational approaches, have been proposed (Arbib & Lee, 2007; Garagnani, Wennekers, & Pulvermüller, 2008; O’Reilly, 1998; Rothkopf & Ballard, 2010). Arbib and colleagues designed several models, including the FARS model (Fagg & Arbib, 1998) and various incarnations of the MNS models (MNS: Oztop & Arbib, 2002; MNS2-I: Bonaiuto, Rosta, & Arbib, 2007; MNS2-II: Bonaiuto & Arbib, 2010), that might be conducive to the intended cross-disciplinary investigation the topic of this special issue. Two other proposals merit attention in this regard since they have started to formalize some procedures to build cross-disciplinary frameworks to investigate psychological and neuroscientific phenomena. These two methods are: the brain-based devices (BBDs) approach (Fleischer & Edelman, 2009) and the computational embodied neuroscience (CEN) approach (Caligiore et al., 2010; Mannella, Mirolli, & Baldassarre, 2010; cf. Prescott, Montes-Gonzalez, Gurney, Humphries, & Redgrave, 2006, for a similar but less principled approach).

The BBDs approach and the CEN method are similar in conception. The key features of models based on these two methods are: (a) a simulated brain whose anatomy and physiology is constrained by knowledge about real brains; (b) an embodied system which operates in a real environment; (c) the comparison with data from behavioral experiments; (e) the adaptive learning of the behavior. However, differently from BBDs, the CEN approach is also guided by the further and fundamental meta-constraint of theoretical cumulativity. This idea aims at producing general models that account for an increasing number of experiments, avoiding at the same time to build ad-hoc models which account for only specific single experiments. In this way it could be possible to isolate general principles underlying the class of studied phenomena, thereby producing theoretical cumulativity.

To facilitate the integration among different perspectives and different methods it will also be crucial to design “system-level models”. This means that the main goal of the model should be to provide an operational hypothesis about the cerebral network of networks which underlies the investigated behavior. The system-level approach postulates that the different classes of behaviors are generated by the interplay of different subsets of components of the brain, rather than by specific components in isolation. In this way it will be possible to outline an integrated hypothesis about the system-level architectural and functioning brain mechanisms which might underlie the behavior under investigation. For example, a system-level model might take into account both cortical (Rizzolatti & Arbib, 1998) and sub-cortical (Strick, Dum, & Fiez, 2009) mechanisms underlying the embodiment of language, or might facilitate the interpretation of brain imaging data (Friston, 2009). We hope that, in the future, designing theoretical and computational frameworks using multi-disciplinary approaches such as those proposed by the BBDs and CEN methods will help to provide a unified view of embodiment of vision, action, and language, highlighting all its challenging aspects and fostering further research into this exciting topic.