Experimental Brain Research

, Volume 193, Issue 4, pp 603–614

The dog’s meow: asymmetrical interaction in cross-modal object recognition

Authors

    • Department of PsychologyThe Hebrew University of Jerusalem
  • Leon Y. Deouell
    • Department of PsychologyThe Hebrew University of Jerusalem
    • Interdisciplinary Center for Neural ComputationThe Hebrew University of Jerusalem
Research Article

DOI: 10.1007/s00221-008-1664-6

Cite this article as:
Yuval-Greenberg, S. & Deouell, L.Y. Exp Brain Res (2009) 193: 603. doi:10.1007/s00221-008-1664-6

Abstract

Little is known on cross-modal interaction in complex object recognition. The factors influencing this interaction were investigated using simultaneous presentation of pictures and vocalizations of animals. In separate blocks, the task was to identify either the visual or the auditory stimulus, ignoring the other modality. The pictures and the sounds were congruent (same animal), incongruent (different animals) or neutral (animal with meaningless stimulus). Performance in congruent trials was better than in incongruent trials, regardless of whether subjects attended the visual or the auditory stimuli, but the effect was larger in the latter case. This asymmetry persisted with addition of a long delay after the stimulus and before the response. Thus, the asymmetry cannot be explained by a lack of processing time for the auditory stimulus. However, the asymmetry was eliminated when low-contrast visual stimuli were used. These findings suggest that when visual stimulation is highly informative, it affects auditory recognition more than auditory stimulation affects visual recognition. Nevertheless, this modality dominance is not rigid; it is highly influenced by the quality of the presented information.

Keywords

AuditoryVisualHumanMultisensoryObject recognitionConflict

Introduction

In natural circumstances, information is frequently obtained simultaneously from several sensory modalities. Normally, information emanating from one object is compatible across different modalities; an approaching car does not sound like a train engine, and a dog opening its mouth will bark, not chirp. Plausibly, this redundancy of information facilitates recognition. Nevertheless, when conditions are not optimal (for example when the environment is noisy, input arrives very fast, or from multiple sources), perceptual errors might give rise to sensory incongruity and thus create cross-modal conflicts. Investigating the interferences caused by conflicts and their resolution is an important step towards better understanding of cross-modal integration mechanisms.

Research on cross-modal conflicts has focused mostly on conflicts regarding the phonetic content of speech (McGurk Illusion; McGurk and MacDonald 1976) on the spatial location of events (Ventriloquism Effect; Bermant and Welch 1976; Bertelson 1999; Bertelson and Radeau 1981), and also recently on conflicts regarding the timing of events (Bertelson and Aschersleben 2003; Morein-Zamir et al. 2003; Recanzone 2003). This research attempted to determine the relative influence of each modality in the resolution of the conflict. The evidence suggests that the visual modality is dominant in the spatial domain while the auditory modality is more influential in the time domain. However, this asymmetry may not be as intrinsic as previously thought. Recent evidence suggests that it depends on the amount and reliability of information provided by the different senses in different circumstances. For example, blurring the visual input reduces the degree to which visual stimuli dominate spatial localization of multisensory events (Alais and Burr 2004; Hairston et al. 2003a). Here, we address cross-modal conflicts and integration in the case of object recognition.

A common way to study the influence of one stream of information on another is by using conflict situations as in a Stroop paradigm (Stroop 1935). In cross-modal versions of the Stroop paradigm, subjects were required to name a patch of color, which was presented simultaneously with a spoken word. Similar to the classic Stroop effect, color naming was slowed by incongruent spoken color names, but word naming was not affected by the colors of the patches (e.g. Elliott et al. 1998; Roelofs 2005; Shimada 1990). Other studies investigated Stroop-like conflicts between more complex visual stimuli and spoken names including digits and digit names (e.g. Driver and Baylis 1993; Mynatt 1977), letters and letter names (Larsen et al. 2003; Tellinghuisen and Nowak 2003), written and spoken words (Lewis 1972; Sen and Posner 1979) or pictures and names of common objects (Stuart and Carrasco 1993). All studies found that an unattended spoken word, presented simultaneously with a visual object or word, influenced the responses to the visual stimulus. A few also found that spoken word recognition was affected by the simultaneous presentation of a written word. This is taken as evidence for an automatic interaction between unattended and attended modalities. However, since these studies investigated conflicts between a visual object or word and a spoken word, their findings cannot be attributed purely to conflict between auditory and visual object recognition processes; this is better addressed by presenting visual objects and recognizable non-verbal sounds. A few recent studies investigated such conflicts (Beauchamp et al. 2004; Laurienti et al. 2003; Lehmann and Murray 2005; Molholm et al. 2004), but none of them included both an attend-visual (ignore-auditory) and an attend-auditory (ignore-visual) conditions. Thus, these studies could not examine cross-modal asymmetries in the effect of the unattended-modality on the processing of the attended-modality.

In the present study, we investigated cross-modal audio–visual object recognition using Stroop-like conflicts between objects’ identities. Subjects were presented simultaneously with pictures and vocalizations of animals which were either congruent (of the same animal) or incongruent (of different animals). The stimuli were followed by a verification word (e.g. “dog?”, “cat?”) and the subjects were instructed to answer “yes” or “no” according to the identification of either the visual or the auditory stimulus. In a previous study which addressed the EEG correlates of cross-modal congruity we used a similar behavioral paradigm and found that cross-modal congruity effects existed whether the visual or the auditory stimulus was attended, albeit unattended visual information affected auditory recognition more than the reversed situation (Yuval-Greenberg and Deouell 2007). In the current study, the main research goal was to establish this asymmetry, and especially to determine the effect of the quality of information presented by each modality on modality dominance. Thus, the visual information was presented using either high- or low-contrast stimuli. Two experiments were run. In the first experiment, the verification item was visual (a written word) and in the second experiment it was auditory (a spoken word). Our results confirm that under optimal conditions, irrelevant visual information affects recognition of sounds more than vice versa. However, the visual superiority may be eliminated under adverse conditions for visual recognition.

Experiment 1

Materials and methods

Subjects

Fifty-five native Hebrew speaker students of the Hebrew University of Jerusalem participated in the experiment for payment or course credits. Data of eight subjects were rejected from the analysis because of low performance (less than 70% accuracy) in one of the conditions. The remaining 47 subjects were 18–37 years old with a mean of 25.3 and a standard deviation of 3.6; 28 were females, 43 were right-handed and 1 was ambidextrous. All subjects reported normal hearing and normal or corrected-to-normal vision with no history of neurological problems. The study was conducted in accordance with the ethical regulations of the Hebrew University of Jerusalem. All subjects signed an informed consent.

Stimuli

The auditory stimuli included 48 vocalizations of 6 familiar animals (cat, dog, rooster, bird, sheep and cow; 8 different vocalizations for each animal) and 48 unidentifiable mechanical sounds which were used as neutrals. The sounds (sampling rate: 22,050 Hz; resolution: 16 bits per sample) were trimmed to last 500 ms. They were presented from a loudspeaker placed immediately under the monitor displaying the visual stimuli. In a pilot study, the auditory stimuli were presented similar to their presentation in the main experiment, and subjects (N = 7) were required to name the animal aloud or to indicate if the sound does not represent an animal vocalization. Only stimuli which were accurately named by all subjects were included in the main experiment. Visual stimuli consisted of 48 color photographs of the same 6 animals (8 different pictures of each animal), half of which were high contrast (High Contrast condition) and the other half were low contrast (Low Contrast condition; see Fig. 1). The neutral visual stimuli were 48 unidentified pictures which were each created by arranging 25 randomly selected and slightly blurred sections of the original pictures in a 5 × 5 pattern. Half of the neutral stimuli were high contrast and half were low contrast (Fig. 1). The contrast of the low-contrast pictures was lowered by 70–80% (depending on the original quality of the picture) compared to their original contrast. The final average Michelson Contrasts (\( {\text{MC}} = (I_{ \max } - I_{ \min } )/(I_{ \max } + I_{ \min } ), \) where Imax and Imin represent the highest and lowest luminance in each picture; Michelson 1927) were 0.97 for the high-contrast pictures (with a range of 0.8–1) and 0.13 for the low-contrast pictures (with a range of 0.07–0.22). The stimuli subtended 4.3°(w) × 3.1°(h) at the center of a ViewSonic G75f CRT monitor (Walnut, California) with a resolution of 768 × 1,024 and a refresh rate of 100 Hz, 100 cm away from the eyes.
https://static-content.springer.com/image/art%3A10.1007%2Fs00221-008-1664-6/MediaObjects/221_2008_1664_Fig1_HTML.gif
Fig. 1

Trial procedure. The task was predetermined by a question presented at the beginning of each block: “Which animal did you see?” or “Which animal did you hear?”. The fixation cross was presented for 1 s, followed immediately by one of the stimulus combinations for 500 ms. This was followed, either immediately (in the short-delay condition of “Experiment 1” and in Experiment 2) or after 1 s (in the long-delay condition of Experiment 1), by a verification item presented as a forced-choice yes/no question (e.g. “dog?”) until a response was given or until 3 s elapsed. The verification item was a written (Experiment 1) or a spoken (Experiment 2) word. Box: samples of visual stimuli. The visual stimuli included high- and low-contrast pictures of animals and neutral pictures

Procedure

Subjects were seated in a sound-attenuated chamber. The experiment included 2 main blocks (attend-visual and attend-auditory) of 288 trials each. A short break was given following every 48 trials. The order of the blocks was counterbalanced across subjects. In addition, a short training block, including 15 trials and feedback on performance, was presented before each main block. The pictures and vocalizations used in the training block were not presented in the main blocks.

The trials in all blocks were composed of the concurrent presentation of a visual and an auditory stimuli. In both blocks, one-third of the trials was congruent (picture and vocalization of the same animal), one-third was incongruent (picture and vocalization belonging to different animals) and the rest were neutral (a picture of an animal with a neutral sound in the attend-visual block or a vocalization of an animal with a neutral picture in the attend-auditory block). Half of the trials in each condition presented high-contrast pictures, whereas the other half presented low-contrast pictures. The order of the trials was random. Each picture and each vocalization were presented twice in each condition (congruent, incongruent and neutral). Since the number of possible incongruent combinations is very large, a different list of incongruent trials was randomly selected for each subject.

Subjects were instructed at the beginning of each block to recognize and respond based on the auditory stimuli alone in the attend-auditory blocks or based on the visual stimuli alone in the attend-visual blocks. Trials began with a fixation cross presented for 1 s (Fig. 1). The audio–visual stimulus was then presented lasting for 500 ms. For one group of subjects (short-delay group; N = 21), the stimulus was followed immediately by a verification item (e.g. “dog?”) in Hebrew. For the second group of subjects (long-delay group; N = 26), the presentation of the verification item was delayed by an intervening black screen lasting 1 s, resulting in an SOA of 1,500 ms from the audio–visual stimulus onset to the verification item onset. This long delay ensured full processing of the auditory stimulus (which takes time to unfold) by the time the verification item appeared. In both cases, the verification item remained on the screen until a response was given or until 3 s elapsed. Subjects were instructed to respond “yes” or “no” by pressing one of two buttons. For congruent and neutral trials, the verification item was either the name of the animal presented by the stimulus (for “yes” trials) or a random name of one of the other animals (for “no” trials). For incongruent trials, the verification item included the name of either the attended animal or the unattended animal and never a third animal. This was necessary to avoid the performance of above chance level when subjects erroneously attended the wrong modality. The problem would have arisen after presentation of an incongruent pair followed by a third animal name. In such a case, the correct answer should be ‘no’, but the subject could provide this answer even if he or she had attended the wrong modality. Thus, performance level would have been artificially elevated.

Subjects responded by pressing one of two buttons to indicate their response. The correct response was affirmative in half of the trials and negative in the other half, independent of whether the trial was congruent, incongruent, or neutral. Deferring the presentation of the verification item until after the disappearance of the audio–visual stimulus ended compensated for the intrinsic difference in temporal evolution of visual and auditory stimuli, as response could not be selected before the auditory stimulus was fully presented even in the short-delay condition. It also disentangled the brain responses related to perceptual conflicts from those related to response conflicts in related studies using measures of neural activity (Yuval-Greenberg and Deouell 2007).

Analysis

Only reaction times of correct responses were analyzed. For each subject, trials with reaction times of more than two standard deviations from the mean of each condition were discarded (remaining trials included 93.9–97.3% of correct-response trials per subject with an average of 95.3%, as expected considering the typical normal distribution of reaction times). Differences between conditions were evaluated using analysis of variance (ANOVA). Greenhouse–Geisser correction was used to correct the degrees of freedom where indicated. The corrected α values are presented along with the uncorrected degrees of freedom and the Greenhouse–Geisser epsilon (ε).

Results

The effect of contrast on visual recognition

We first established that the contrast manipulation indeed impeded recognition of the visual stimuli, by examining the speed of performance under different contrasts in the attend-visual neutral condition alone, i.e. without specific interference or facilitation by auditory information. Reaction times were longer and accuracy was lower for the low-contrast condition compared to the high-contrast condition, as confirmed by the main effect of Contrast in a two-way mixed ANOVA with factors Contrast and Delay (RT: F(1, 45) = 6.06, P < 0.05; accuracy: F(1, 45) = 43.83, P < 10−7). The interaction between contrast and delay was significant for RTs (F(1, 45) = 14.44, P < 0.001), but not for accuracy (F(1, 45) = 1.23, P = 0.27). The interaction in the RT data was due to the fact that subjects were slower to respond in the case of low-contrast than high-contrast images (t(20) = 4.2, P < 0.001; Table 1) with the short delay, but not in the long-delay condition (t(25) = 1, P = 0.3). However, in both the short- and long-delay conditions, accuracy was lower with low contrast than with high contrast (short delay: t(20) = 3.67, P < 0.01; long delay: t(25) = 5.83, P < 10−5; Table 1). Taken together, these findings confirm that reducing the contrast impeded recognition of the visual stimuli.
Table 1

Reaction times (RT, in ms) and accuracy rates (ACC, in percent correct) of Experiment 1

 

High contrast

Low contrast

Attend-visual

Attend-auditory

Attend-visual

Attend-auditory

IC

C

N

IC

C

N

IC

C

N

IC

C

N

RT

 Short delay

770 (34)

719 (26)

743 (29)

859 (36)

747 (26)

781 (27)

826 (36)

764 (31)

797 (35)

843 (32)

767 (28)

788 (26)

 Long delay

597 (17)

594 (18)

604 (17)

643 (20)

610 (18)

624 (20)

602 (19)

590 (18)

593 (17)

633 (20)

620 (18)

625 (19)

ACC

 Short delay

96 (1)

98 (1)

97 (1)

91 (1)

98 (1)

98 (1)

89 (2)

92 (1)

92 (1)

93 (1)

98 (1)

98 (1)

 Long delay

95 (1)

97 (~0)

97 (1)

94 (1)

96 (1)

97 (1)

87 (2)

91 (1)

89 (2)

94 (1)

96 (1)

97 (1)

Each cell represents the average of that condition and standard error is represented in parentheses

Next, a four-way mixed model ANOVA was conducted on the reaction times and accuracy with between-subject factor of Delay (short, long) and within-subject factors of Attended-Modality (visual, auditory), Congruity (congruent, incongruent, neutral) and Picture-Contrast (high, low).

Reaction times

All main effects were significant (all P < 0.01). The average reaction times and accuracy rates are depicted in Table 1.

Modality effect

Subjects were overall faster to respond in the attend-visual task (683 ms) than in the attend-auditory task (712 ms), as confirmed by a main effect of Attended-Modality (F(1, 45) = 14.15, P < 0.001). However, a significant interaction of Attended-Modality × Picture-Contrast (F(1, 45) = 4.84, P < 0.01) revealed that this effect was different for low- vs. high-contrast pictures: while in the high-contrast condition the response in the attend-visual task was indeed faster than in the attend-auditory task (main effect of Modality: F(1, 45) = 33.85, P < 10−6), the difference was not significant in the low-contrast condition (F(1, 45) = 3.13, P = 0.08). The same pattern was obtained when analysis was restricted to the neutral conditions. Planned comparisons confirmed that the responses were faster in the neutral attend-visual than in the neutral attend-auditory condition when the contrast was high (two-tailed: t(46) = 3.8, P < 0.001) but not when the contrast was low (t(46) = 1.11, P = 0.27). Thus, whereas the visual task seemed to be easier than the auditory in normal viewing conditions, the contrast manipulation was successful in equating the difficulties. See below the inspection of the accuracy data which precludes a speed accuracy tradeoff.

Congruity effect

Reaction times for incongruent trials (722 ms) were longer than for neutral trials (694 ms) which were in turn longer than for congruent trials (676 ms; main effect of Congruity: F(2, 90) = 68.12, P < 10−13, ε = 0.74). Compared to neutral trials, both facilitation with congruent trials (F(1, 45) = 50.8, P < 10−8) and interference with incongruent trials (F(1, 45) = 39.63, P < 10−6) were significant (supplemental figure S1). The main effect of Congruity was also significant for each Attended-Modality separately (separate three-way, Congruity × Picture-Contrast × Delay ANOVAs; attend-visual: F(2, 90) = 19.06, P < 10−6, ε = 0.93; attend-auditory: F(2, 90) = 46.83, P < 10−10, ε = 0.83) and for each contrast (separate two-way, Congruity × Attended-Modality ANOVA; low contrast: F(2, 90) = 37.81, P < 10−11, ε = 0.998; high contrast: F(2, 90) = 45.97, P < 10−10, ε = 0.77). However, the size of the congruity effect was manipulated by these factors (Fig. 2a, b; Picture-Contrast × Attended-Modality × Congruity interaction; F(2, 90) = 1.89, P < 0.01, ε = 0.94). To examine this three-way interaction we conducted two-way ANOVAs separately for the high and the low Picture-Contrast conditions. These revealed a significant Attended-Modality × Congruity interaction when the pictures were of high contrast (F(2, 90) = 15.55, P < 0.0001, ε = 0.81), but not when the pictures were of low contrast (F(2, 90) < 1). The interaction in the high-contrast condition was due to a larger congruity effect in the attend-auditory condition than in the attend-visual condition (Fig. 2). Further analysis showed that this was due to more interference (incongruent–neutral; F(1, 45) = 28.72, P < 10−5) in the attend-auditory than in the attend-visual condition. In contrast, the amount of facilitation (congruent–neutral) was similar for the two conditions (F(1, 45) < 1, P = 0.39). In summary, irrelevant visual information interfered more with auditory recognition than vice versa, but only in high visual contrast condition.
https://static-content.springer.com/image/art%3A10.1007%2Fs00221-008-1664-6/MediaObjects/221_2008_1664_Fig2_HTML.gif
Fig. 2

Congruity effects. Reaction time and accuracy are affected by cross-modal congruity. All panels show the reaction time gain (incongruent–congruent) and accuracy gain (congruent–incongruent hit rates). Dark bars represent the attend-visual condition, light bars the attend-auditory conditions. Error bars reflect the standard error. a Performance on the short-delay condition of Experiment 1. b Performance on the long-delay condition of Experiment 1. c Performance on Experiment 2

Effects of Delay

The above results suggest that the congruity effect is asymmetrical between modalities, at least in the high-contrast condition. A trivial reason for such asymmetry could be that the auditory stimuli take time to evolve, whereas the visual stimuli are practically instantaneous. Although the auditory stimuli were highly recognizable and repeated a few times, some minimal amount of auditory information must nevertheless be presented for recognition to occur, and therefore the stimuli might not have had fair chance to affect visual recognition by the time the verification item appeared. To control for this possible confound in the long-delay condition, we added a delay between the stimulus presentation and the response that would enable the auditory stimulus to be fully presented and processing to be completed, and would thus minimize any inherent advantage of the visual stimulation. Thus, we predicted that if the asymmetry was a result of these trivial differences between the time courses of auditory and visual stimuli, it should be reduced or eliminated with the longer delay between the offset of the stimuli and the onset of the verification item.

The effect of delay on the modality asymmetry can be estimated from the four-way interaction of Delay × Picture-Contrast × Attended-Modality × Congruity. Since this interaction was not significant (P = 0.94), there is no evidence for an effect of processing duration on the observed asymmetry between modalities. No effect of delay is seen even when the analysis is restricted to the high-contrast condition, where the asymmetry was maximal (the three-way interaction of Attended-Modality × Congruity × Delay, restricted to the high-contrast condition, is non-significant; F(1, 45) = 5, P = 0.17).

The four-way ANOVA described above (Delay × Attended-Modality × Congruity × Picture-Contrast) revealed main effect of Delay (F(1, 45) = 28.47, P < 0.00001) caused by a significantly longer reaction times at the short-delay condition (784 ms on average) compared to the long-delay condition (611 ms). An interaction of Delay × Congruity (F(2, 90) = 30.9, P < 10−7) was also seen, reflecting the fact that although the congruity effect was significant and at the same direction in both conditions (Fig. 2a, b; short delay: F(2, 40) = 48.89, P < 10−7, ε = 0.7; long delay: F(2, 50) = 10.8, P < 0.001), it was diminished in size in the longer delay condition.

To summarize, increasing the delay between the stimuli and the verification item speeded the responses and reduced somewhat the congruity effect, but did not change the finding that the congruity effect was larger in the attend-auditory condition than in the attend-visual condition.

Accuracy rates

The four-way mixed model ANOVA revealed main effects of Attended-Modality (F(1, 45) = 20.68, P < 0.0001), Picture-Contrast (F(1, 45) = 56.5, P < 10−8) and Congruity (F(2, 90) = 42.82, P < 10−14, ε = 0.9) but unlike the effect on RT, there was no main effect of Delay (F(1, 45) < 1) on accuracy. Similar to the RTs’ analyses, the four-way interaction was not significant (F(2, 90) = 1.17, P = 0.31).

The three-way interaction of Attended-Modality, Picture-Contrast and Congruity was close to significant (F(2, 90) = 3.21, P = 0.054, ε = 0.85). A follow-up two-way ANOVA on the high-contrast condition alone revealed a significant Attended-Modality × Congruity interaction (F(2, 90) = 7.68, P < 0.01, ε = 0.91), stemming from a higher congruity effect in the attend-auditory than in the attend-visual condition (Fig. 2). However, when the pictures were of low contrast there was no such interaction (F(2, 90) = 1.13, P = 0.34).

Further analysis of the high-contrast condition suggested that the interaction was the result of a significant difference across attended modalities in the amount of interference (F(1, 45) = 44.4, P < 10−7), whereas there was no difference in facilitation (F(1, 45) < 1, P = 0.46).

In summary, just as for RTs, incongruent visual information interfered with auditory recognition more than vice versa, when the visual contrast was high. When visual recognition was impeded by reducing the contrast, this interaction was abolished.

Experiment 2

In Experiment 1, the verification item which followed the stimuli was presented visually, requiring subjects to read it. Subsequently, Experiment 1 left open the possibility that the observed inter-modal asymmetry was due to subjects paying more attention to the visual than to the auditory modality, even when the task was auditory. This would explain why irrelevant visual information interfered more with auditory recognition than vice versa. Experiment 2 was designed to test this conjecture by using spoken words rather than printed words as the verification items.

Materials and methods

Subjects

Eighteen subjects participated in the experiment. The data of two subjects were rejected from the analysis because of technical problems during the experiment. The remaining subjects (9 females, 12 right-handed, ages ranging from 18 to 26 with a mean of 23 and a standard deviation of 2.1) were of similar characteristics to the subjects of Experiment 1. The study was conducted in accordance with the ethical regulations of the Hebrew University of Jerusalem. All subjects signed an informed consent.

The stimuli and procedures replicated the short-delay condition of Experiment 1 except that the verification item, appearing after the stimuli, was a pre-recorded spoken word (name of an animal) and not a printed word. Similar analysis was conducted, except that there was no delay factor as only the short delay was used in this experiment.

Results

The main results of the Experiment 2 replicated those of Experiment 1 (Table 2; Fig. 2). A three-way ANOVA with factors Congruity, Modality and Contrast was conducted.
Table 2

Reaction times (RT, in ms) and accuracy rates (ACC, in percent correct) of Experiment 2

 

High contrast

Low contrast

Attend-visual

Attend-auditory

Attend-visual

Attend-auditory

IC

C

N

IC

C

N

IC

C

N

IC

C

N

RT

672 (24)

635 (27)

664 (25)

836 (43)

732 (39)

769 (43)

740 (21)

699 (23)

720 (25)

808 (42)

764 (44)

800 (44)

ACC

96 (1)

98 (1)

98 (1)

93 (1)

98 (1)

96 (1)

88 (2)

92 (1)

93 (1)

93 (1)

98 (1)

96 (1)

Each cell represents the average of that condition and standard error is represented in parentheses

The effect of contrast on visual recognition

Examining the neutral attend-visual trials revealed that reaction times were shorter and accuracy was higher for the high-contrast compared with the low-contrast condition (RT: t(15) = 6.23, P < 10−4; accuracy: t(15) = 5.6, P < 10−4). Thus, the contrast manipulation affected the performance of the visual task.

Modality effect

Subjects were faster in the attend-visual than in the attend-auditory condition (F(1, 15) = 16.12, P < 0.01), but this modality effect was significantly larger in the high than the low visual contrast conditions (Attended-Modality × Picture-Contrast: F(1, 15) = 19.74, < 0.001). Differing from the findings of the first experiment, the modality effect was significant in both contrasts when examined separately (two-way ANOVAs; high contrast: F(1, 15) = 25.55, P < 0.001; low contrast: F(1, 15) = 7.9, P < 0.05). The main effect of Modality on accuracy (in the three-way ANOVA) was not significant (F(1, 15) = 3.68, P = 0.074), but as for RTs, the interaction of Attended-Modality and Picture-Contrast was (F(1, 15) = 37.45, P < 10−4). Examining the two contrasts separately revealed that, similar to the RT results, accuracy rates were higher for the attend-visual compared to the attend-auditory condition in the high contrast (F(1, 15) = 6.72, P < 0.05) but in the low contrast a significant opposite effect was found (accuracy was lower for visual compared to auditory; F(1, 15) = 7.9, P < 0.05).

Congruity effect

Congruity affected the speed of performance for both attended modalities (showing both interference and facilitation; Fig. 2c and supplemental figure) but the attended-modality and the contrast modulated this congruity effect (Picture-Contrast × Attended-Modality × Congruity interaction; F(2, 30) = 5.93, P < 0.01, ε = 0.98). Further analysis of the interaction showed, as in Experiment 1, that in the high-contrast condition, visual information interfered more with auditory recognition than vice versa (incongruent–neutral; F(1, 15) = 9.82, P < 0.01). In contrast, the effect was symmetrical in the low-contrast condition (no Modality × Condition interaction; F(1, 15) = 1.75, P = 0.2).

Although the same trend for congruity effects seen in the RT results was also present in the accuracy rates (Fig. 2), the accuracy rates showed no significant interaction of Attended-Modality, Contrast and Congruity (F(2, 30) = 1.17, P = 0.32, ε = 0.81). The Attended-Modality × Condition interactions were not significant in both the low and the high contrasts, when examined separately (low: F(2, 30) = 1.75, P = 0.2, ε = 0.85; high: F(2, 30) = 0.47, P = 0.63, ε = 0.79). The difference between the RT and the accuracy rates might result from a ceiling effect on the accuracy rates which were very high in all conditions (Table 2).

General discussion

The results of these experiments can be summarized as follows: information from the unattended-modality, be it visual or auditory, can either facilitate (if it is congruent) or interfere (if it is incongruent) with object recognition in the other modality. Under normal, undisturbed conditions, unattended incongruent visual information interferes more with auditory recognition than vice versa. This suggests preferred reliance on visual information in object recognition. This asymmetry can be eliminated however if the quality of visual information is reduced, rendering the difficulty of auditory and visual recognition closer (Fig. 2). Thus, visual dominance in object recognition seems to be a matter of ‘opportunistic’ use of the most available information, rather than a fixed dominance of one modality.

Cross-modal interactions requiring higher level processing were previously found in Stroop-like paradigms creating incongruity between spoken words and visual stimuli such as pictures or letters (see “Introduction”). Although this suggests interactions at the verbal or semantic level, the conclusions from these studies are limited by the fact that the unattended information was always a word. Thus, these studies did not clarify to what degree unattended non-verbal information about objects is automatically processed at the semantic level1 and whether it can influence recognition in another modality. This gap is filled by the present demonstration of a Stroop-like effect based on semantic congruity between two non-verbal stimuli. The fact that we used objects that are natural and ecologically valid in their congruent form may have strengthened the effect. A recent study (von Kriegstein and Giraud 2006) showed that ecologically valid cross-modal pairs (e.g. faces and voices) create strong associations that facilitate functional brain connectivity and object recognition, and that are not found with unnatural pairs (e.g. phones and ring tones).

Cross-modal congruity effects might occur at different stages of processing: early processing stage, perceptual representation stage, abstract post-perceptual representation stage (semantic or verbal) or decision making stage (Marks 2004). Congruity at the early stages may be based on “functional equivalence” between the senses. For example, the timing, spatial location, or movement direction of an object may be directly evaluated within both modalities. In addition, several features seem to have some perceptual equivalence. For example, pitch and visual brightness interact (Marks 2004). However, none of these low-level features plays a role in the present experiment. In our experiment, the congruity was based on high-level semantic information conveyed by the image and the sound, and therefore the most likely level at which the effects occurred is the semantic level. Also, although our stimuli were objects and not words, a post-recognition lexical level cannot be ruled out. Some subjects reported adopting a strategy of mentally naming the target animal during the time between the stimulus presentation and the response. This could have been affected by congruity if not only the target name but also the irrelevant animal’s name was activated.

In a previous study (Molholm et al. 2004), a cross-modal congruity effect was found for targets, predefined as a specific animal (e.g. cow) regardless of whether it consisted of the auditory or the visual part of multisensory stimuli, but not for non-targets. In the present paradigm, all stimuli can be considered targets in the sense that they all required a response. This target effect may indicate that early filtering, perhaps based on low-level features, may abort processing prior to multisensory integration stages. In another study using an “old–new” visual memory task (Lehmann and Murray 2005), congruity did not significantly affect the subjects’ performance on their first encounter of a multisensory stimulus (i.e. their ability to say “new”). In that study, such as in the attend-visual parts of our experiments, subjects’ attention was directed to the visual modality. Thus, the task may be an important determinant of cross-modal effects. Specifically, our task may have emphasized more explicit (semantic) recognition, whereas the old–new task emphasized familiarity (or, for the first occurrence, novelty). These task effects will require further exploration.

The interference between modalities was present whether the subjects attended the visual information and ignored the auditory or vice versa. However, under full visual contrast it was stronger when subjects attended the sounds than when they attended the images. Similar visual advantage was previously found in speeded classification of visual and auditory features (Ben-Artzi and Marks 1995; Marks et al. 2003), as well as in Stroop-like paradigms with verbal stimuli (Larsen et al. 2003; Lewis 1972; Sen and Posner 1979). Sen and Posner even found the asymmetry to be specific to interference (rather than facilitation), as found in the present study. Also, compatible with visual advantage in recognition is the finding that visual events prime both visual and auditory events, but auditory events (e.g. a baby crying) prime only auditory events (Greene et al. 2001).

The asymmetry found using the full contrast images replicates our previous findings in a study that addressed mainly the neural correlates of cross-modal congruity (Yuval-Greenberg and Deouell 2007). However, that study left open several alternative interpretations regarding the basis of the cross-modal imbalance, which are addressed in the present study. The enhanced reliance on unattended (and disruptive) visual information in our previous study, as in the present Experiment 1, could have been explained by the fact that the verification item was in the form of a printed word, that is, presented in the visual modality. This could direct some of the subject’s attention to the visual domain even in the case of the attend-auditory condition. However, this was ruled out in Experiment 2 in which the visual advantage was replicated despite the fact that the verification word was now spoken, i.e. presented in the auditory modality. A second, trivial reason for visual advantage could be the fact that visual information appears instantaneously on the retina, while complex sounds such as animal vocalizations must unfold in time before enough information is gathered for recognition. It could be argued that in the short-delay condition of Experiment 1, as well as in Yuval-Greenberg and Deouell (2007), at least some of the vocalizations were not completely processed by the time the verification item was presented, 500 ms from stimulus onset. This could have reduced the effect of unattended sounds. However, insufficient processing time is unlikely to be the main reason for the visual–auditory imbalance, as in the long-delay condition of the present Experiment 1 a much longer delay (1,500 ms from onset, or 1,000 ms from stimulus offset) was introduced, yet the imbalance remained unaffected. It is implausible that by this time the auditory stimuli were not completely processed. Finally, it could be argued that sub-vocal rehearsal of the name of the attended visual stimulus could have interfered with processing of the unattended auditory stimulus, thus reducing its interference. This, however, is unlikely. Considering the time needed for the processes of visual recognition, lexical selection, and phonological (including premotor; see Bullmore et al. 2000; Huang et al. 2008) programming, which are required for sub-vocal rehearsal, there was hardly time to start actual sub-vocal rehearsing within the 500 ms between the onset and offset of the stimulus. For comparison, actually naming aloud the high-contrast visual stimuli in the pilot study with 10 subjects started on average 821 ms (SD = 113) after stimulus onset.

The finding that visual stimuli influence auditory processing more than vice versa could be interpreted therefore as reflecting visual dominance in object recognition. By the “modality appropriateness” model of cross-modal integration (Welch and Warren 1980), the dominant modality in cases of conflict is the modality better equipped for processing a specific stimulus dimension, and the visual system may be the “appropriate” modality for object recognition. Indeed, studies on visual–tactile interactions show that the visual modality is, in most cases, the preferable source of information on objects’ identities (Rock and Victor 1964). In the extreme form of this account, the interaction between two modalities is inherently biased, perhaps even embedded in brain circuitry.

Recent findings, however, suggest that the direction of cross-modal effects is not deterministic (Witten and Knudsen 2005). Rather, it is a function of the reliability of information conveyed by each modality. For example, auditory information does not add much to the localization of audio–visual stimuli when the visual input is optimal, but it improves localization when the visual information is sub-optimal (Hairston et al. 2003a, b; Heron et al. 2004). Comparable results were found for visual-somatosensory integration (Ernst and Banks 2002; Ernst et al. 2000). According to this view, the brain integrates information in an optimal way, i.e. based on the particular circumstances, and thus influence from the nominally inferior modality would be evident when the information in the usually dominant modality is deficient. Our findings extend this view to the case of visual–auditory cross-modal object recognition. The prototypical, undisturbed visual image of a familiar animal may have provided information which was more distinctive and which activated more robust stored representations than the auditory vocalization, biasing the decision in favor of the visual information. Nevertheless, the elimination of the cross-modal asymmetry in our low-contrast condition shows that the reliance on the visual information is not mandatory or hard wired.

Lowering the contrast of the visual stimuli prolongs visual processing. This has been shown in the cat and the monkey using single cell recordings (Albrecht et al. 2002; Reich et al. 2001) as well as in human MEG (Hall et al. 2005) and EEG (Calvert et al. 2005; Porciatti et al. 2000). Congruently, lowering the contrast of our stimuli reduced the visual speed advantage in the neutral conditions. However, it seems that slowing of visual processing cannot fully explain the elimination of the modality asymmetry in the low-contrast condition. Had visual speed advantage been the main determinant of the basic modality asymmetry observed with high contrast, we would expect that increasing the stimulus-response delay would reduce the asymmetry by providing extra processing time. This, however, was not the case—the sensitivity of the auditory recognition to cross-modal incongruity remained higher than that of the visual recognition in the long-delay condition with high-contrast visual stimuli. Lowering the contrast not only slows processing, but also inherently reduces the information available to the visual system. When contrast is lowered, the pertinent data in the image are displayed in a smaller fraction of the available luminance range, and so two levels which were previously separated could now be united. As a result, close-by pixels which previously had clearly separable luminance levels might now be much more similar, thus reducing the amount of available information. Especially, information on contours and texture, which is essential for object recognition, is severely degraded. Moreover, low-contrast stimuli are much worse activators of the parvocellular system, a central player in the ventral visual stream (so called ‘what system’), than high-contrast stimuli. The lack of information in the low-contrast stimuli was evident in lower performance level for these pictures in the attend-visual condition. Not only were the reaction times prolonged, relative to the equivalent high-contrast condition, but the accuracy was also significantly reduced, making it unlikely that the contrast effect was only on the speed of low-level visual processing (Näsänen et al. 2006; Porciatti et al. 2000; Reich et al. 2001). Thus, in the low-contrast condition, the baseline difficulty of the visual and the auditory recognition became closer, and consequently, the cross-modal asymmetry was eliminated. We conclude that although the visual modality is in many circumstances the more influential modality in object recognition, its dominance depends on the quality of information it provides. The resolution of conflict is affected by the availability of information in the two modalities, as predicted by optimal integration model.

Footnotes
1

In “semantic representation”, we include here also the access to the object’s name.

 

Supplementary material

221_2008_1664_MOESM1_ESM.pdf (109 kb)
Supplemental Fig. 1: Facilitation and interference compared to neutral. For reaction time, the congruent (C, crossed bars) or incongruent (IC, dotted bars) RTs were subtracted from the neutral RT. For accuracy, the neutral hit rate was subtracted from the congruent or incongruent hit rate. Thus, facilitation is upwards, interference downwards for both measures. Dark bars attend-visual, light bars attend-auditory. Error bars reflect the standard error. a Performance on the short-delay condition of Experiment 1. b Performance on the long delay condition of Experiment 1. c Performance on Experiment 2 (PDF 108 kb)

Copyright information

© Springer-Verlag 2008