The effect of spatial–temporal audiovisual disparities on saccades in a complex scene
- First Online:
- Cite this article as:
- Van Wanrooij, M.M., Bell, A.H., Munoz, D.P. et al. Exp Brain Res (2009) 198: 425. doi:10.1007/s00221-009-1815-4
- 540 Downloads
In a previous study we quantified the effect of multisensory integration on the latency and accuracy of saccadic eye movements toward spatially aligned audiovisual (AV) stimuli within a rich AV-background (Corneil et al. in J Neurophysiol 88:438–454, 2002). In those experiments both stimulus modalities belonged to the same object, and subjects were instructed to foveate that source, irrespective of modality. Under natural conditions, however, subjects have no prior knowledge as to whether visual and auditory events originated from the same, or from different objects in space and time. In the present experiments we included these possibilities by introducing various spatial and temporal disparities between the visual and auditory events within the AV-background. Subjects had to orient fast and accurately to the visual target, thereby ignoring the auditory distractor. We show that this task belies a dichotomy, as it was quite difficult to produce fast responses (<250 ms) that were not aurally driven. Subjects therefore made many erroneous saccades. Interestingly, for the spatially aligned events the inability to ignore auditory stimuli produced shorter reaction times, but also more accurate responses than for the unisensory target conditions. These findings, which demonstrate effective multisensory integration, are similar to the previous study, and the same multisensory integration rules are applied (Corneil et al. in J Neurophysiol 88:438–454, 2002). In contrast, with increasing spatial disparity, integration gradually broke down, as the subjects’ responses became bistable: saccades were directed either to the auditory (fast responses), or to the visual stimulus (late responses). Interestingly, also in this case responses were faster and more accurate than to the respective unisensory stimuli.
KeywordsMultisensory integration Human Gaze control Race model Natural scene
Saccadic eye movements reorient the fovea fast and accurately to a peripheral target of interest. Much of the neurophysiological mechanisms underlying saccades (Findlay and Walker 1999; Munoz et al. 2000, for review) have been revealed by studies carried out under simplified conditions, in which a single visual target evokes a saccade in an otherwise dark and silent laboratory room.
However, under more natural conditions, potential targets may be masked by a noisy audiovisual (AV) background. The brain should then segregate these targets from the background, weed out the irrelevant distractors, determine the target coordinates in the appropriate reference frame, and prepare and initiate the saccade. This is a highly nontrivial task, and it is thought that efficient integration of multisensory inputs could optimize neural processing time and response accuracy (Stein and Meredith 1993; Anastasio et al. 2000; Calvert et al. 2004; Colonius and Diederich 2004a; Binda et al. 2007).
Indeed, many studies have shown that combined auditory and visual stimuli lead to a significant reduction of saccade reaction times (SRTs; Frens et al. 1995; Hughes et al. 1998; Colonius and Diederich 2004b). Theoretical analyses have shown that this reduction cannot be explained by mere statistical facilitation, an idea that is formalized by the so-called ‘race model’ (Raab 1962; Gielen et al. 1983). This principle holds that sensory inputs are engaged in a race, whereby the saccade is triggered by the sensory event that first crosses a threshold. This benchmark model predicts that, in the absence of any bimodal integration, the expected distribution of minimum reaction times shifts toward shorter latencies than those for the unimodal responses. Saccades elicited by simple AV stimuli show a general reduction of the SRT, in combination with a systematic modulation by the spatial–temporal stimulus separation (Frens et al. 1995; Hughes et al. 1998). Whereas the former effect may be attributed to statistical facilitation or to a nonspecific warning effect, the spatial–temporal modulation cannot be accounted for by the race model, and is a clear indication of AV integration. Spatial–temporal effects may be understood from neural interactions within a topographically organized multisensory representation. For that reason, the midbrain Superior Colliculus (SC) has been considered as a prime candidate for multisensory integration (Stein and Meredith 1993, for review; see also Anastasio et al. 2000, for theoretical accounts). Electrophysiological studies in the intermediate and deep layers of the SC have indicated that similar spatial–temporal interactions are found in the sensory and motor responses of saccade-related neurons (Meredith and Stein 1986; Meredith et al. 1987; Peck 1996, in cat; Frens and Van Opstal 1998; Bell et al. 2005, in monkey).
Yet, the majority of AV integration studies have typically been confined to the situation of one visual target, combined with one auditory stimulus (the latter often a distractor: Frens and Van Opstal 1995; Corneil and Munoz 1996; Harrington and Peck 1998; Colonius and Diederich 2004b). Few studies have quantified the effects of multisensory integration in more complex environments. In a recent study we investigated saccades to visual (V), auditory (A), and AV-targets in the two-dimensional frontal hemifield within a rich AV-background that contained many visual distractors and spatially diffuse auditory white noise (Corneil et al. 2002). The target could be a dim red LED, or a broadband buzzer. We systematically varied the signal-to-noise ratio (SNR) of the target sound versus background noise, to assess unisensory sound-localization behavior, and constructed spatially aligned AV-targets from the four different SNR’s and three onset asynchronies (12 different AV target types). In such a rich environment, the first V-saccade responses in a trial typically had long-reaction times, and were often in the wrong direction. A-saccades were typically faster than V-saccades, where the SNR primarily affected accuracy in stimulus elevation and saccade reaction time. Interestingly, all AV stimuli manifested AV integration that could best be described by a “best of both worlds” principle: auditory speed at visual accuracy (Corneil et al. 2002, Fig. 10).
Note, that the subject’s task in these experiments was unambiguous: make a fast and accurate saccade to the target that appears as soon as the fixation light is extinguished. Yet, in more natural situations one cannot assume in advance that given visual and auditory events arose from the same object in space. In particular, as sound-localization is often less accurate than vision, perceived stimulus locations need not be aligned either. This was also the case in the experiments of Corneil et al. (2002), especially for the low SNR’s. However, the effect of perceived spatial misalignment (Steenken et al. 2008) was not investigated in that study.
Here, we describe AV integration for the situation that the subject has no advance knowledge about the spatial configuration of the stimuli. We thus extended the paradigm of Corneil et al. (2002) by introducing a range of spatial disparities between auditory and visual stimuli, and instructed the subject to localize the visual stimulus fast and accurately, and to ignore the auditory distractor. We varied the spatial and temporal disparities of AV stimuli, as well as the SNR of the auditory distractor against the background noise.
Although in the current experiments the auditory stimulus did not provide a consistent spatial cue for the visual target, we found that the saccadic system still efficiently used acoustic information to generate faster and more accurate responses for spatially aligned stimuli (presented in only 16% of the trials). We also obtained a consistent relation of the subject’s error rate with SRT for all (aligned and disparate) stimuli: for short SRT’s, saccades were acoustically guided, thus often ending at a wrong location. Late saccades were typically visually guided. For intermediate SRT’s to spatially disparate stimuli responses could either be auditorily or visually guided, but responses were still faster and more accurate than in the unisensory conditions. Similar bistable behavior has been reported for auditory and visual double-stimulation experiments engaged in target/non-target discrimination tasks (e.g., Ottes et al. 1985; Corneil and Munoz 1996). A theoretical account for our results is discussed.
Five subjects, aged 24–44 (mean 30.2 years) participated in this study after having given their informed consent. All procedures were in accordance with the local ethics committee of the Radboud University Nijmegen. Three subjects (A. John Van Opstal, JO; Andrew H. Bell, AB; and Marc M. Van Wanrooij, MW) are authors of this article; the remaining two (JG and JV) were naïve about the purpose of the study. Subjects JO and MW also participated in a similar previous study (Corneil et al. 2002). All subjects reported normal hearing and, with corrective glasses or lenses worn in the experimental setup (JG and JV), had normal binocular vision, except for JO, who is amblyopic in his right (recorded) eye. The eye signal calibration procedure (see below) was corrected for any nonlinearity that may have been present in this subject’s data.
A detailed description of the experimental setup can be found in Corneil et al. (2002). Briefly, experiments took place in a completely dark and sound-attenuated room, in which echoes above 500 Hz were effectively attenuated and the overall background sound level was about 30 dB, A-weighted (dBA). Subjects were seated facing a rich stimulus array with their head supported by an adjustable neck rest. Horizontal and vertical eye movements were recorded using the scleral search coil technique (Robinson 1963; Collewijn et al. 1975), sampled at 500 Hz/channel.
The auditory background was generated by a circular array of nine speakers (Nellcor), mounted onto the wire frame at about 45° eccentricity (Fig. 1a). Sound intensities were measured at the position of the subject’s head with a calibrated sound amplifier and microphone (Brüel & Kjaer BK2610/BK4144, Norcross, GA), and are expressed in dBA. The auditory background consisted of broadband Gaussian white noise (0.2–20 kHz) at a fixed intensity of 60 dBA. The auditory distractor stimulus was produced by a broadband lightweight speaker (Philips AD-44725, Eindhoven, the Netherlands) mounted on a two-link robot, which allowed the speaker to be positioned in any direction at a distance of 90 cm (Hofman and Van Opstal 1998). The auditory distractor stimulus consisted of a periodic broad-band noise (period 20 ms, sounding like a 50 Hz buzzer) that had a flat broad-band characteristic between 0.2 and 20 kHz, presented at a variable intensity (see below).
Subjects completed three different paradigms: a visual calibration paradigm, an auditory localization paradigm, and the AV distractor paradigm that contributed to the bulk of the experimental data. Every session began with the visual calibration paradigm followed by 2–4 blocks of the auditory localization and/or AV distractor paradigms.
Subjects were required to generate saccades to visual stimuli pseudo-randomly presented to 1 of 60–72 possible target locations (12 directions, 5–6 different eccentricities between 5 and 35°) in the absence of the AV-background. Each trial began by turning the central LED red (fixation point) for 800 ms. When it extinguished, a peripheral red target LED was illuminated which the subject had to refixate. Each target location was presented once. Similar to Corneil et al. (2002), the final saccadic endpoint was used for calibration purposes, whereas the endpoint of the first saccade was used for the visual-only data (VNOBG, without background).
Subjects generated saccades to auditory targets in the presence and absence of the AV-background (A and ANOBG, respectively). These data served to assess sound-localization performance under different SNR conditions. Each trial began with fixation of the central visual fixation point for 600–850 ms. Then, an auditory target was presented from 1 out of 25 possible locations within the oculomotor field. Auditory targets were presented at four different SNRs relative to the acoustic background (−6, −12, −18, −21 dB). A- and ANOBG-trials were run in separate blocks, often within the same experimental session.
Audiovisual distractor paradigm
Subjects generated saccades amidst an AV-background to V- and AV-targets. Each trial began with the appearance of the AV-background (Fig. 1a). After a randomly selected delay of either 150, 275, or 400 ms, the central LED turned red, which the subject had to fixate for 600–850 ms. The fixation LED was then turned green, and after a 200 ms gap a peripheral red target LED was illuminated. Subjects had to generate a saccade quickly and accurately to the peripheral target LED. The location of the target was selected pseudo-randomly from 1 out of 12 possible locations (12 directions, R = 20, 27°; Fig. 1a).
Stimulus types used in the experiments
Number of responses
All data analysis was performed in MatLab 7.4 (The Matworks, Inc.).
Response data were calibrated by training two three-layer neural networks with the back-propagation algorithm that mapped final eye positions onto the target positions of the visual calibration paradigm (Goossens and Van Opstal 1997). Eye-position data from the other paradigms were then processed using these networks, yielding an absolute accuracy <3% over the entire range. Saccades were automatically detected from calibrated data, based on velocity and acceleration criteria using a custom-made program. Onset and offset markings were visually checked by the experimenter, and adjusted if necessary.
Modality index and perceptual disparity
In the analyses presented here, we pooled data across subjects, unless noted otherwise. Statistical significance of a difference between two distributions was assessed by the 1D or 2D KS-test, where we took P < 0.05 as the accepted level of significance. The analysis was thus based on a total of 8776 trials. Table 1 gives a detailed breakdown of trials per subject.
We first quantify the basic properties of the V- and A-saccades in our experiments, as they are crucial for later comparisons with the AV-responses.
V- and A-saccades
An important difference between V- and A-saccades, which cannot be readily observed from the primary saccade responses, is the difference in localization percepts induced by the AV-background (Fig. 4f). Although it could take a few attempts/saccades, subjects eventually localized the V-target (red line). In contrast, the background noise introduced a large undershoot in azimuth and elevation also for the final A-saccades. This aspect is important for the AV-disparity experiment, since the stimulus disparity between A- and V-targets deviated from the perceptual disparity. We will return to this difference in a later section.
Spatially aligned AV stimuli
In only 16% of the AV trials the auditory stimuli were spatially coincident with the visual target. Subjects were asked to localize the visual target fast and accurately regardless the auditory distractor. Here, we first analyze responses to these stimuli, to check whether AV interactions would still follow the same rules as in the Corneil et al. (2002) study.
In contrast, the A75V stimuli (auditory leading; Fig. 5c, d) both produced bimodal SRT distributions, with longer SRTs than the fastest A-distribution. Interestingly, bimodal response distributions were not obtained in the Corneil et al. (2002) study (see also “Discussion”). Note that the stochastically independent race model (Eq. 6) is also violated for these stimuli (Fig. 5e), as it predicts a single-peaked, faster (or equally fast) SRT distribution for all AV stimuli (the response SRTs even fail to reach the lower bound of the race model of Eq. 5, not shown). Yet, the measured distribution does not coincide with the predicted bistable response distribution of Eq. 7 (e.g., Fig. 2b) either. Thus, we conclude that both AV stimulus types underwent multisensory integration.
In contrast, two response clusters might be expected for A75V stimuli, corresponding to bistable responses (Fig. 5c, d). We therefore performed a K-means clustering analysis (K = 2, based on SRT, response azimuth, elevation, eccentricity, and direction), which indeed divided the data into distinct distributions (labeled by blue squares and red circles; Fig. 7c, d) with relatively high silhouette-values (0.76 for A1275V and 0.73 for A1875V).
The separated clusters (black ellipses) can be readily compared to the straightforward bistable model, which would yield two AV-clusters coinciding with either unisensory V- and A-distribution (Fig. 2). For the V75A stimuli and also for larger numbers of clusters on the A75V stimuli, the silhouette-values quickly dropped to values <0.5, indicating that a larger number of clusters is not readily observed in the data.
Taking a coarser look at the A75V data (Fig. 7c, d), the blue cluster best resembles the A-distribution, while the red cluster resembles the V-distribution. Yet, some responses in both clusters have SRT’s and errors that could have resulted from either cluster. A better look at the data reveals a gradual improvement in localization error as reaction time progresses, rather than a sudden drop that would have resulted from a true bistable mode as subjects would have shifted from fast and inaccurate auditory, to slow but accurate visual responses. In fact, at any given SRT AV-responses were more accurate than the unisensory responses, which further underline the evidence for multisensory interaction.
Spatially disparate AV stimuli
For A1275V stimuli with a considerable angular disparity (here ΔΦ = ±90°; Fig. 8b), however, K-means cluster analysis produced two clear distributions that appeared to obey the principles of a bistable mechanism: either auditory (blue), or visual (red) responses.
Figure 8c, d summarizes our findings for all 24 AV stimulus conditions employed in this study. In 17/24 conditions the response data could be separated into two clusters (single-cluster conditions: V75A12, Δφ = 90 and ΔR = 1.5; V75A18, Δφ = 0 and Δφ = 180; A1275V, Δφ = 0; A1875V Δφ = 0 and Δφ = 90). Figure 8c normalizes the cluster with the longest SRT against the V-responses, whereas in Fig. 8d the cluster with the shortest SRT was normalized against A-saccades (−12 and −18 dB). If these responses would follow the simple bistable model of Fig. 2, all points would scatter around the center of these plots. As data points lie predominantly in the lower-left quadrant, the interesting point of this analysis is, that for all stimulus conditions responses were actually better (i.e., faster and more accurate) than pure V- and A-saccades. Hence, even for spatially unaligned stimuli, AV enhancement occurs and the simple bistable model should be rejected.
Figure 8e, f summarizes our analysis for all perceived disparities of the A1275V and V75A12 stimuli. A clear pattern emerges in this plot: only when perceived disparity is very small, MI is close to zero (green-colored bins), indicative for multisensory integration. It rapidly splits into two clusters for larger perceived disparities, with invariably aurally guided responses (blue) for the short SRTs (<250 ms), and visually guided saccades for longer SRTs (red). Hence, these plots delineate a sharply-defined spatial–temporal window of AV integration. Similar results were obtained for the A18 distractor (not shown).
We studied the responses of the human saccadic system when faced with a visual orienting task in a rich AV environment and a competing auditory distractor. Our experiments extend the findings from Corneil et al. (2002) who assessed AV integration when visual and auditory stimuli both served as a target, and were always spatially aligned. Under such conditions the system responded according to a “best of both worlds” principle: as A-only saccades are typically fast but inaccurate (Fig. 4), and V-saccades are accurate but slow (Fig. 3), the AV-responses were both fast and accurate. These experiments demonstrated a clear integration of AV channels, whereby the interaction strength depended on the SNR of the target sound and the temporal asynchrony of the stimuli.
In the present study spatially aligned AV-targets comprised only a minority of trials (16%), while in the large majority (>80%) the auditory accessory did not provide any consistent localization cue to the system. Such a condition is arguably a more natural situation, as in typical complex environments there is no a priori knowledge about whether given acoustic and visual events originated from the same object in space.
Our data indicate that the orienting task belied a dichotomy, which was quite hard for our subjects. This was especially clear for stimuli in which the distractor preceded the visual stimulus by 75 ms (A75V condition; Figs. 5c, d and 8). In this case, the auditory input arrives substantially earlier in the CNS (by about 130 ms) and as a consequence subjects were unable to ignore the auditory distractor at short SRTs (<250 ms), as responses then appeared to be triggered by the sound. This was true for both spatially aligned (Figs. 5, 7, and 8a) and -disparate stimuli (Fig. 8e) and led to bimodal SRT distributions. A similar result for large horizontal eye-head gaze shifts was reported by Corneil and Munoz (1996) when salient AV stimuli were presented at opposite locations (ΔΦ = 180°, ΔR = 80°) without an AV-background. However, the stimulus uncertainty in that study was limited, as target and distractor could occupy only two possible locations.
In line with our observations on bistability, Corneil et al. (2002) found no bimodal response distributions. Note that in their study the perceived stimulus disparity was small compared to the current study (data not shown, but mean ± SD: 3.3 ± 1.4 vs. 19.8 ± 15.3°, respectively). The present study indicates that a small perceived disparity (<10°) does not elicit bistable responses (e.g., Fig. 8e, f).
Note that the height of the first SRT peak reflected the SNR of the acoustic distractor (Fig. 5c, d), which underlines our conclusion that these responses were indeed aurally guided (Fig. 8a). Interestingly, however, for the relatively rare spatially aligned condition the SRT distributions for A75V stimuli differed from the predictions of both the race model (Fig. 5e) and the bistable model (Fig. 2b) in that later responses, triggered by the visual stimulus, still had faster than visual latencies. Moreover, even though early responses were acoustically triggered, their accuracy was better than for A-only saccades (Fig. 7). Thus, similar multisensory integration mechanisms as described by Corneil et al. (2002) also appear to operate efficiently in a rich environment that contains much more uncertainty.
Also in spatially unaligned conditions early responses were acoustically triggered and, therefore, typically ended near the location of the distractor (Fig. 8). Later responses were guided toward the visual target (Fig. 8c–f). The data from those AV stimuli thus seem to follow the predictions of the bistable model (cf. Fig. 2) much better. However, the quantitative analysis of Fig. 8c, d indicates that even in the situation of large spatial disparities the system is not driven exclusively by one stimulus modality, as responses are clearly influenced by the other modality too. Hence, a weaker form of multisensory enhancement persists that allows these responses to still outperform the unisensory-evoked saccades.
Taken together, our data show that the saccadic system rapidly accounts for the spatial–temporal relations between an auditory and visual event, and uses this information efficiently to allow multisensory integration to occur, provided the perceived spatial disparity is small. For disparities exceeding approximately 10–15°, the stimuli are treated as arising from different objects in space (Kording et al. 2007; Sato et al. 2007), which results in a bistable response mode (Fig. 8e, f). Thus, when forced to respond rapidly to a specified target, the system is prone to frequent localization errors. However, even in that case multisensory integration occurs, as the putative stimuli evoked faster and more accurate responses than their unisensory counterparts.
We use the terms “unimodal” and “bimodal” in a statistical sense (single- and double-peaked distributions, respectively), without referring to the unisensory or multisensory origin of the response distributions.
We greatly acknowledge technical support of T van Dreumel and H Kleijnen. We thank R Aalbers and PM Hofman for crucial contributions to the software. We also thank prof. H Colonius for constructive comments on an earlier draft of this manuscript. Experiments were carried out in the Nijmegen Laboratory as part of the Human Frontiers Science Program (Research Grant RG-0174/1998-B; AJVO and DPM). This research was further supported by a VICI grant of the Dutch NWO/ALW (AJVO and MMVW grant nr. 805.05.003), the Canadian Institutes of Health Research (AHB and DPM), and the Radboud University Nijmegen (AJVO and MMVW).
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.