In people’s everyday life, the majority of experiences involve multiple sensory modalities. We are thus required to be able to switch across different sensory modalities in different situations. A classic example involves the musicians in an orchestra: they are required to be able to quickly process visually presented auditory contents (i.e., sheet music along with the conductor’s gestures) in order to perform. However, this only happens through years of studying. Indeed, recent research has shown that people experience a cognitive cost in shifting attention between different sensory modalities. Interestingly, such cognitive cost occurs both when switching between events presented in different modalities (Spence, Nicholls, & Driver, 2001) as well as when switching between sentences having different modality contents (Pecher, Zeelenberg, & Barsalou, 2003). For example, switching from the sentence “BLENDER is loud” to the sentence “BANANA is yellow” incurs a processing cost much like switching from an auditory tone to a light flash. This phenomenon is known as Modality-Shifting or the Modality-Switch Effect (hereafter MSE).

The MSE with language has been extensively explored with both behavioral (Marques, 2006; Pecher, Zeelenberg, & Barsalou, 2004; Scerrati, Baroni, Borghi, Galatolo, Lugli, & Nicoletti, 2015; van Dantzig, Pecher, Zeelenberg, & Barsalou, 2008; see also Vermeulen, Niedenthal, & Luminet, 2007 for a similar result with emotional concepts) and ERP studies (Collins, Pecher, Zeelenberg, & Coulson, 2011; Hald, Marshall, Janssen, & Garnham, 2011; Hald, Hocking, Vernon, Marshall, & Garnham, 2013). Whether the finding of a purely perceptual phenomenon during conceptual processing is just the epiphenomenal result of spreading activation or evidence supporting the assumption that perceptual information is engaged in conceptual processing is debatable. On the one hand, it has been argued that the conceptual system is separated from sensory information (disembodied cognition hypothesis; see Mahon & Caramazza, 2008; Mahon & Hickok, 2016 for discussions). On this account, the MSE would reflect the way in which activation spreads throughout the system, and therefore it would not reveal anything about semantic processing. On the other hand, it has been assumed that the perceptual and conceptual systems are tightly interwoven and share the same processing mechanisms. Proponents of the grounded accounts of knowledge (Barsalou, 2008; for recent reviews, see Borghi & Caruana, 2015; Pecher, 2013) assume that knowledge representation and processing is achieved by reactivating aspects of experience. In particular, conceptual processing would imply constructing a sensorimotor simulation of the objects or events that concepts refer to. Such a simulation would involve the partial reactivation of those brain areas that were also active during our interaction with the concepts’ referents. For example, on processing the concept DOG, brain areas that represent visual, auditory, tactile, olfactory, gustatory, affective, and motor information about dogs would be liable to partial reactivation. Importantly, simulations are sketchy records of experience that can be flexibly adapted to the context and task at hand (Barsalou, 1999; Gallese, 2009).

Recently, Scerrati et al. (2015) obtained evidence that sensorimotor simulations can also be triggered by a perceptual, linguistically described stimulus presented in a sensory modality different from vision (i.e., the auditory modality). Participants were presented with a prime sentence describing a light’s or a sound’s perceptual property (e.g., “The light is flickering”, “The sound is echoing”), then they were required to perform a property-verification task on a target sentence with a vision-related or a hearing-related content (e.g., “Butter is yellowish”, “Leaves rustle”). The sensory modality activated by the content of the prime sentence could be compatible with the target’s content modality (e.g., vision–vision: “The light is flickering” followed by “Butter is yellowish”) or not (e.g., vision–audition: “The light is flickering” followed by “Leaves rustle”). Crucially, the stimuli’s presentation modality was manipulated such that half of the participants were faced with written prime and target sentences while the other half were faced with spoken prime and target sentences. The results showed that participants were faster at judging whether a certain property was true of a given concept when the target’s content modality corresponded to the one pre-activated by the content of the prime sentence with both visual and aural presentation of stimuli.

In the present study, we were interested in examining whether switching between different mode of presentation (i.e., visual, aural) across prime and target sentences conveying a sensory content brings about a modality-switching cost. Specifically, we aimed at understanding whether and how the conceptual MSE is modulated by the mode of presentation of stimuli. To our knowledge, no previous study has explored this issue in regard to the MSE. Interestingly, however, different studies have found that sentence processing can be affected by mode of presentation. Kaschak, Zwaan, Aveyard and Yaxley (2006, Experiment 2) showed that participants were faster in making sensibility judgements on target sentences when the direction of motion implied by the sentence with a hearing-related content (e.g., “The commuter had just arrived on the platform when the subway roared into the station”) and the direction of motion depicted by a concurrent auditory stimulus were the same, provided that both the sentence and the stimulus were aurally presented. In a different yet related study, Vermeulen, Corneille and Niedenthal (2008) showed that asking people to store three visual or auditory items (i.e., pictures or sounds) in short-term memory for a subsequent memory task resulted in a worse performance in an intervening property verification task when the latter concerned sentences involving properties in the same modality as that of the stored items (interference hypothesis). Vermeulen et al. (2008) suggested that the general attentional load imposed upon participants together with the high complexity of the dual-task paradigm used in their study moderated switching costs. On the basis of this previous evidence, we expect that the mode of presentation of sentences might be relevant in modulating the MSE. Specifically, given that we neither manipulate attentional load, nor use a dual-task paradigm, we expect to observe a facilitation when the prime and the target share the same presentation and content modality as in prior studies where switching costs were found.

Whether and how the conceptual MSE is affected by the mode of presentation of stimuli may hinge on task demands. Connell and Lynott (2014) found that task-specific implicit perceptual attention preactivates modality-specific systems leading to facilitated representation of semantic information related to those modalities. That is, preactivating the visual system through the presentation of strongly visual words (e.g., “cloudy”) facilitated performance in the lexical decision task, whereas preactivating the auditory system through the presentation of strongly auditory words (e.g., “noisy”) facilitated performance when the task was reading aloud. In the present research, we used two different tasks: the property verification and the lexical decision task (LDT; McNamara, 1992). We believe that the mode of presentation of stimuli might differently impact the conceptual MSE on the basis of the depth of processing required by the task. With the property verification task, we predict the observing of a better performance when the presentation and the content modalities of target sentences are congruent (e.g., “Butter is yellowish” presented visually) compared to when they are incongruent (e.g., “Butter is yellowish” presented aurally) due to the depth of processing required by the task. With a less conceptually engaging task such as the LDT, we instead expect that the mode of presentation might feature prominently compared to the content modality of sentences.

Methods

Participants

A total of 128 students from the University of Bologna (79 females; mean age: 21.45, SD 2.37) participated in the experiment in exchange for course credit. Of the total participants, 65 were randomly assigned to the property verification task condition while 63 were randomly assigned to the LDT condition. All participants were Italian native speakers, had normal or corrected-to-normal vision and hearing by self-report, and were naïve as to the purpose of the experiment. The experiment was approved by the Psychology Department’s ethical committee of the University of Bologna. Written informed consent was obtained from all individual participants included in the study. Minors did not take part in the study.

Materials

Totals of 24 prime sentences and 48 target sentences were used in this experiment. Stimuli were the same as in Scerrati et al. (2015). Half of the prime sentences had a vision-related content (e.g., “the LIGHT is flickering”) whereas the other half had a hearing-related content (e.g., “the SOUND is echoing”). Properties in the visual and auditory prime sentences were taken from the norming study by Lynott and Connell (2009) and from a rating of 50 Italian adjectives (see Appendix A in Scerrati et al., 2015). Each of the 24 prime sentences was repeated four times throughout the experiment: they were aurally presented twice over closed-ear headphones and they were visually presented twice on the screen.

Target sentences were taken from the van Dantzig et al.’s study (2008) with 24 having a vision-related content (e.g., “a WALNUT is brown”) and a hearing-related content (e.g., “a BEE buzzes”). In these critical pairs, the property was always true of the concept. Each pair was used only once. Two properties were repeated once across the pairs, although paired with different concepts (i.e., “a BEE buzzes”, “a FLY buzzes”; “BROCCOLI is green”, “SPINACH is green”). For an overview of the visual and auditory prime and target sentences, see Appendix B in Scerrati et al. (2015). Prime and target sentences were the same across tasks.

As for the property verification task, an additional set of 48 filler sentences, also taken from van Dantzig et al. (2008), was used. In the filler sentences, the property was always false of the concept with 12 having a false visual property (e.g., “the WATER is opaque”), and 12 a false auditory property (e.g., “the COMB sings”), whereas the remaining 24 filler sentences had a false property that did not belong to any modality (e.g., “the BED is sleepy”). This latter type of filler was used in order to avoid participants basing their answers on a superficial word-association strategy, rather than on deeper conceptual processing (see Solomon & Barsalou, 2004).

As for the LDT, an additional set of 48 filler sentences featuring a non-word was used. In half of the filler sentences, the non-word was the concept word, whereas in the other half the non-word was the property word. Non-words were generated by altering two of the consonants or the double consonant keeping the vowels unchanged so as to preserve the phonotactic rules of Italian.

Each participant was presented with 96 prime sentences followed by 96 target sentences (48 critical and 48 fillers) throughout the experimental session. Prime and target sentences were randomly combined to form four modality conditions: different–different (DD, when both the presentation and the content modalities switch from prime to target sentence), different–same (DS, when the presentation modality switches but the content modality does not), same–different (SD, when the content modality switches but the presentation modality does not) and same–same (SS, when the prime and the target sentences share the same presentation and content modalities). For example, a visually presented prime sentences with a vision-related content (e.g., “the LIGHT is flickering”) could be combined with: (1) an aurally presented target sentence with a hearing-related content (e.g., “a BEE buzzes”, DD); (2) an aurally presented target sentence with a vision-related content (e.g., “a WALNUT is brown”, DS); (3) a visually presented prime sentences with a hearing-related content (e.g., “a BEE buzzes”, SD); or (4) a visually presented prime sentence with a vision-related content (“a WALNUT is brown”, SS). Each target sentence appeared in all modality conditions, counterbalanced across participants.

Procedure

The stimuli were presented on a 17–inch (c.43-cm) monitor (1.6 Ghz refresh rate). The participants sat at a viewing distance of about 60 cm from the monitor in a dimly-lit room. They were invited to wear a pair of headband headphones before starting the experiment. Each trial started with the presentation of a fixation cross (0.5 cm × 0.5 cm) for 500 ms. Immediately after the fixation, the prime sentence appeared on the screen or was delivered through headphones for 2000 ms. Then, the target sentence was displayed on the screen or delivered through headphones until a response was given or until 4000 ms had elapsed. Visually presented prime and target sentences ranged from 5.9 to 17.3 cm (from 9 to 29 characters) which resulted in a visual angle range between 5.6° and 16.5°. All sentences were bold lowercase Courier new 18 and were presented in black in the center of a white background. Participants were instructed to read or to listen to the prime and target sentences and then judge, as quickly and as accurately as possible, whether in each target sentence the property was true of the concept (property verification task condition), or whether in each target sentence there was a non-word or not (LDT condition). In both task conditions, half of the participants pressed the “s” key of a “qwerty” keyboard when either the property was true of the concept or there was a non-word in the target sentence and the “k” key when either the property was false of the concept or the target sentence did not contain a non-word. The other half of the participants were assigned to the reverse mapping.

The order of presentation of each prime-target sentence was completely randomized across participants. Participants underwent a short practice session of 32 stimuli (different from those used in the experimental blocks) before starting the experiment. The experiment consisted of one block of 96 prime-target pairs and lasted approximately 15 min.

Results

In the property verification task condition, five participants (all females) were excluded from the analysis: Four of these participants failed to reach an accuracy score of 65 % while the other participant responded 35 % of the trials in less than 300 ms, indicating that she may have misconceived the task and tried to also respond on the prime sentence. Sixty participants therefore remained for further analysis. Responses to filler sentences were discarded. Omissions (5.93 %), incorrect responses (21.42 %) and response times (RTs) faster/slower than the overall participant mean minus/plus 2 standard deviations (2.19 %) were excluded from the analyses. In the LDT condition, three participants (two females) failed to reach an accuracy score of 65 %. Their data were removed, leaving 60 participants for further analysis. Responses to filler sentences were discarded. Omissions (5.03 %), incorrect responses (7.04 %) and RTs faster/slower than the overall participant mean minus/plus 2 standard deviations (2.60 %) were excluded from the analyses.

Mean RTs of the correct responses were submitted to a repeated analysis of variance (ANOVA) with Mode of Presentation (different vs. same), Content Modality (different vs. same) and Target Congruency (incongruent vs. congruent) as the within-subject factors for the two tasks (property verification vs. lexical decision) separately. Data are shown in Table 1.

Table 1 Mean response times (in ms) and percentage of errors with standard deviations in parentheses as a function of Mode of Presentation (MoP different, same), Content Modality (CM different, same) and Target Congruency (TC incongruent, congruent) for each task separately

In the property verification task condition, there was a main effect of Mode of Presentation, F(1,59) = 4.582, MSe = 75789.90, p < .05, η p 2 = .072, that is, decision latencies were faster when the Mode of Presentation was the same across prime and target sentences rather than different (M: 2036 ms vs. 2090 ms). The analysis also revealed a main effect of Target Congruency (F(1,59) = 18.633, MSE = 65906.25, p < .001, η p 2 = .240), that is, decision latencies were faster when the Mode of Presentation and the Content Modality of the target were congruent rather than incongruent (M: 2013 ms vs. 2114 ms). No other main effect or interaction turned out to be significant, F s < 2.66, p s > .108.

In the LDT condition, there was a main effect of Mode of Presentation, F(1,59) = 6.544, MSe = 59889.45, p < .05, η p 2 = .1, that is, decision latencies were faster when the Mode of Presentation was the same across prime and target sentences rather than different (M: 1942 ms vs. 1999 ms). No other main effect or interaction turned out to be significant, F s < 1.420, p s > .238.

Mean of the incorrect responses were submitted to an ANOVA with the same factors as those of the RTs analysis. In the property verification task condition, no main effect or interaction turned out to be significant, F s < 2.247, p s > .139. In the LDT condition there was a significant interaction between Mode of Presentation and Target Congruency, F(1,59) = 4.484, MSe = 140.56, p < .05, η p 2 = .071. Paired-sample t tests showed that percentage of ERs was higher when the Mode of Presentation was the same across prime and target sentences and the target was congruent compared to different and incongruent (9.5 % vs. 6.8 %), t(59) = –2.075, p <.05, same and incongruent (9.5 % vs. 5.8 %), t(59) = –2.225, p < .05, and different and congruent (9.5 % vs. 5.9 %), t(59) = –2.327, p < .05. No other main effect or interaction turned out to be significant, F s < 2.683, ps > .107.

General discussion

The present research investigated whether and to what extent switching between different modes of presentation (i.e., visual, aural) across prime and target sentences affects the conceptual MSE. Although previous studies investigated how sentence processing can be affected by mode of presentation of linguistic stimuli, such relationships had not previously been studied in the context of the MSE. Given that the impact of the mode of presentation of stimuli on language processing may be modulated by task demands (see Connell & Lynott, 2014 for a similar result in a different context), we compared performance on a property verification priming paradigm with performance on a lexical decision priming paradigm, each involving different levels of conceptual processing.

In keeping with our hypothesis, we found evidence for the involvement of the mode of presentation of stimuli in both the property verification and the lexical decision task. Crucially, results from both tasks showed that the presentation-driven effect weakens the conceptual MSE. Indeed, a conceptual MSE was observed in the property verification task, but not in LDT, as expected; however, it did not reach significance.Footnote 1

Interestingly, the property verification task highlighted an effect of the target congruency. That is, we found that participants were slower in deciding whether a certain property was true of the concept when the presentation and the content modality were incongruent for the target (e.g., “a BEE buzzes” presented visually) compared to when they were congruent. Such a within-target MSE is in line with the results of van Dantzig et al. (2008), showing that, when a perceptual stimulus (i.e., a light flash, a tone or a vibration) and a subsequent target sentence were in a different sensory modality, decision latencies were slower compared to when they were in the same modality. Our results broaden their finding showing such an effect within the same stimulus, that is, when the processing of perceptual and conceptual information overlap in time. It is worth noting that such interference only occurred with the property verification task. Therefore, it seems likely that, since the lexical decision task did not emphasize conceptual processing, it only recruited the semantic system to a certain extent insufficient to generate interference between the two systems.

In sum, our findings show that conceptual processing is not only affected by switching between sensory modalities on a semantic level (i.e., content modality of stimuli) but also by switching between sensory modalities on a purely perceptual level (i.e., mode of presentation of stimuli). Interestingly, our results also demonstrate a task-dependent, complex interplay of perceptual and semantic information taking place within the target. These findings question the view according to which the MSE does not reveal anything about semantic processing as claimed by the critic of the grounded accounts of knowledge (Mahon & Caramazza, 2008).

We conclude that the MSE is a task-related, multilevel effect which can occur on two different levels of information processing, i.e., perceptual and semantic. We interpret these results as further evidence supporting the view according to which the perceptual and conceptual systems are tightly interwoven and share the same processing mechanisms as claimed by the simulation account of conceptual processing (Barsalou, 1999, 2003, 2008).