Abstract
Vision typically has better spatial accuracy and precision than audition and as a result often captures auditory spatial perception when visual and auditory cues are presented together. One determinant of visual capture is the amount of spatial disparity between auditory and visual cues: when disparity is small, visual capture is likely to occur, and when disparity is large, visual capture is unlikely. Previous experiments have used two methods to probe how visual capture varies with spatial disparity. First, congruence judgment assesses perceived unity between cues by having subjects report whether or not auditory and visual targets came from the same location. Second, auditory localization assesses the graded influence of vision on auditory spatial perception by having subjects point to the remembered location of an auditory target presented with a visual target. Previous research has shown that when both tasks are performed concurrently they produce similar measures of visual capture, but this may not hold when tasks are performed independently. Here, subjects alternated between tasks independently across three sessions. A Bayesian inference model of visual capture was used to estimate perceptual parameters for each session, which were compared across tasks. Results demonstrated that the range of audiovisual disparities over which visual capture was likely to occur was narrower in auditory localization than in congruence judgment, which the model indicates was caused by subjects adjusting their prior expectation that targets originated from the same location in a task-dependent manner.
Similar content being viewed by others
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Alais D, Burr D (2004) The ventriloquist effect results from near-optimal bimodal integration. Curr Biol 14(3):257–262
Battaglia PW, Jacobs RA, Aslin RN (2003) Bayesian integration of visual and auditory signals for spatial localization. J Opt Soc Am A 20(7):1391–1397
Beierholm UR, Quartz SR, Shams L (2009) Bayesian priors are encoded independently from likelihoods in human multisensory perception. J Vision 9(5):23
Bertelson P, Radeau M (1981) Cross-modal bias and perceptual fusion with auditory-visual spatial discordance. Percept Psychophys 29(6):84–578
Bizley JK, Shinn-Cunningham BG, Lee AKC (2012) Nothing is irrelevant in a noisy world: sensory illusions reveal obligatory within- and across-modality integration. J Neurosci 32(39):10–13402
Dobreva MS, O’Neill WE, Paige GD (2011) Influence of aging on human sound localization. J Neurophysiol 105(5):86–471
Dobreva MS, O’Neill WE, Paige GD (2012) Influence of age, spatial memory, and ocular fixation on localization of auditory, visual, and bimodal targets by human subjects. Exp Brain Res 223(4):55–441
Godfroy M, Roumes C, Dauchy P (2003) Spatial variations of visual–auditory fusion areas. Perception 32(10):1233–1245
Hairston WD, Wallace MT, Vaughan JW, Stein BE, Norris JL, Schirillo JA (2003) Visual localization ability influences cross-modal bias. J Cogn Neurosci 15(1):9–20
Hartmann WM, Rakerd B, Gaalaas JB (1998) On the source-identification method. J Acoust Soc Am 104(6):3546–3557
Hooke R, Jeeves TA (1961) “Direct search” solution of numerical and statistical problems. J ACM 8(2):212–229
Howard IP, Templeton WB (1966) Human spatial orientation. Wiley, New York
Jack CE, Thurlow WR (1973) Effects of degree of visual association and angle of displacement on the “ventriloquism” effect. Percept Mot Skills 37(3):79–967
Körding KP, Beierholm U, Ma WJ, Quartz S, Tenenbaum JB, Shams L (2007) Causal inference in multisensory perception. PloS One 2(9):e943
Legendre P (1998) Model II regression user’s guide. R edition, R Vignette
Lewald J, Guski R (2003) Cross-modal perceptual integration of spatially and temporally disparate auditory and visual stimuli. Brain Res Cog Brain Res 16(3):78–468
Mishra J, Martínez A, Hillyard SA (2010) Effect of attention on early cortical processes associated with the sound-induced extra flash illusion. J Cogn Neurosci 22(8):29–1714
Nagelkerke NJD (1991) A note on a general definition of the coefficient of determination. Biometrika 78(3):691–692
Odegaard B, Shams L (2016) The brain’s tendency to bind audiovisual signals is stable but not general. Psychol Sci 27(4):91–583
Odegaard B, Wozny DR, Shams L (2015) Biases in visual, auditory, and audiovisual perception of space. PLoS Comput Biol 11(12):1–23
Odegaard B, Wozny DR, Shams L (2016) The effects of selective and divided attention on sensory precision and integration. Neurosci Lett 12(614):24–28
Razavi B, O’Neill WE, Paige GD (2007) Auditory spatial perception dynamically realigns with changing eye position. J Neurosci 27(38):58–10249
Recanzone GH, Makhamra SD, Guard DC (1998) Comparison of relative and absolute sound localization ability in humans. J Acoust Soc Am 103(2):97–1085
Rohe T, Noppeney U (2015) Cortical hierarchies perform Bayesian causal inference in multisensory perception. PLoS Biol 13(2):1–18
Sato Y, Toyoizumi T, Aihara K (2007) Bayesian inference explains perception of unity and ventriloquism aftereffect: identification of common sources of audiovisual stimuli. Neural Comput 19(12):55–3335
Slutsky DA, Recanzone GH (2001) Temporal and spatial dependency of the ventriloquism effect. Neuroreport 12(1):7–10
Soto-Faraco S, Alsius A (2007) Conscious access to the unisensory components of a cross-modal illusion. Neuroreport 18(4):50–347
Thurlow WR, Jack CE (1973) Certain determinants of the “ventriloquism effect”. Percept Mot Skills 36(3):84–1171
van Atteveldt NM, Peterson BS, Schroeder CE (2013) Contextual control of audiovisual integration in low-level sensory cortices. Hum Brain Mapp 35(5):411–2394
Van Wanrooij MM, Bremen P, Van Opstal AJ (2010) Acquired prior knowledge modulates audiovisual integration. Eur J Neurosci 31(10):71–1763
Wallace MT, Roberson GE, Hairston WD, Stein BE, Vaughan JW, Schirillo JA (2004) Unifying multisensory signals across time and space. Exp Brain Res 158(2):252–258
Warren DH, Welch RB, McCarthy TJ (1981) The role of visual–auditory “compellingness” in the ventriloquism effect: implications for transitivity among the spatial senses. Percept Psychophys 30(6):64–557
Wei XX, Stocker AA (2015) A Bayesian observer model constrained by efficient coding can explain ’anti-Bayesian’ percepts. Nat Neurosci 18(10):17–1509
Wozny DR, Shams L (2011) Computational characterization of visually induced auditory spatial adaptation. Front Integr Neurosci 5:75
Wozny DR, Beierholm UR, Shams L (2010) Probability matching as a computational strategy used in perception. PLoS Comput Biol 6(8):e1000871
Zwiers MP, Van Opstal AJ, Paige GD (2003) Plasticity in human sound localization induced by compressed spatial vision. Nat Neurosci 6(2):81–175
Acknowledgements
We thank Martin Gira and Robert Schor for their technical assistance, and we are immensely grateful we had the opportunity to receive David Knill’s assistance in developing the computational model before his untimely passing. Research was supported by NIDCD Grants P30 DC-05409 and T32 DC-009974-04 (Center for Navigation and Communication Sciences), NEI Grants P30-EY01319 and T32 EY-007125-25 (Center for Visual Science), and an endowment by the Schmitt Foundation.
Author information
Authors and Affiliations
Corresponding author
Appendix: Bayesian inference modeling of auditory localization and congruence judgment tasks
Appendix: Bayesian inference modeling of auditory localization and congruence judgment tasks
This model simulates the perception of temporally synchronous but spatially disparate auditory and visual targets, and subsequent performance of two perceptual tasks (auditory localization and congruence judgment). The model described here is a modified version of previously published work (Körding et al. 2007; Wozny et al. 2010; Wozny and Shams 2011).
1.1 Perception of visual and auditory target locations
Targets locations are constrained to a fixed elevation and a range of azimuths within the frontal field, denoted as \(S_\mathrm{V}\) and \(S_\mathrm{A}\) for visual and auditory targets, respectively. Previous research has demonstrated that spatial perception is inherently uncertain and subject to biases, so the first step of this model is to produce a probability distribution that represents the set of percepts that could be generated by a given target. The generated percepts for visual and auditory targets are denoted \(X_\mathrm{V}\) and \(X_\mathrm{A}\), respectively, and the probabilities of a percept occurring given a particular target are denoted \(p(X_\mathrm{V}|S_\mathrm{V})\) and \(p(X_\mathrm{A}|S_\mathrm{A})\). Previous research has demonstrated that perceived target locations tend to be normally distributed, although they are subject to idiosyncratic biases, a tendency to overestimate eccentricity for auditory targets, and reliability that decreases with eccentricity (Dobreva et al. 2012; Odegaard et al. 2015). To model these aspects of perception, we assume normal distributions with means that are scaled and offset with respect to target location, similar to Odegaard et al. (2015, 2016). The scaling parameters are denoted \(G_\mathrm{V}\) and \(G_\mathrm{A}\), which can model a pattern of overestimating (\(SG > 1\)) or underestimating (\({\text {SG}} < 1\)) target eccentricity. Constant biases in target location are denoted \(\mu _\mathrm{V}\) and \(\mu _\mathrm{A}\). Localization uncertainty (inverse reliability) is represented by \(\sigma _\mathrm{V}\) and \(\sigma _\mathrm{A}\). This model also includes parameters that model increase in uncertainty as a function of eccentricity, given as \(G_{\sigma _\mathrm{V}}\) and \(G_{\sigma _\mathrm{A}}\). Putting these terms together in a normal distribution gives the equations:
Perception is simulated by sampling \(X_\mathrm{V}\) and \(X_\mathrm{A}\) from these distributions. For convenience in later equations, the standard deviation parameters are expressed as \(\sigma _{SV} = \sigma _\mathrm{V} + G_{\sigma _\mathrm{V}} |S_\mathrm{V}|\) and \(\sigma _{SA} = \sigma _\mathrm{A} + G_{\sigma _\mathrm{A}} |S_\mathrm{A}|\). In addition, we assume the existence of a prior bias in target locations, which limits target locations to the frontal field.
1.1.1 Estimating the probability that targets originate from a common source
In order to combine auditory and visual information in a behaviorally advantageous manner, it is necessary to be able to estimate whether or not crossmodal signals originate from a common source. Note that this process does not necessitate conscious decision making, as it could be performed early in the crossmodal sensory pathway. A common method of representing the probability of a common source for two targets (denoted \(p(C=1|X_\mathrm{V},X_\mathrm{A})\)) is given by Bayes’ Theorem:
\(p_\mathrm{common}\) is the prior expectation that targets originate from a common source (\(C=1\)). We assume that the number of targets in the room is limited to either 1 or 2 (\(C=1,C=2\)). Therefore, the law of total probability states that \(p(X_\mathrm{V},X_\mathrm{A})\) can be expressed as:
Substituting into Eq. 6 gives:
In order to obtain the conditional probabilities for target location, we integrate over the latent variable \(S_i\).
The Analytic solutions to these integrals are (Körding et al. 2007):
Substituting these integrands into Eq. 8 enables the probability of common source (\(p(C=1|X_\mathrm{V},X_\mathrm{A}\))) to be computed.
1.1.2 Performing the congruence judgment task
The congruence judgment task is a two alternative forced choice paradigm in which the subject decides whether the visual and auditory targets came from the same location (\(C = 1\)) or two different locations (\(C = 2\)). This decision can be made directly from the probability of a common source, \(p(C=1|X_\mathrm{V},X_\mathrm{A})\), in one of two ways: model selection and probability matching. These two models can be directly compared for a data set, to determine which decision model best explains the observed data.
1.1.3 Model selection
One approach to selecting a response is to always choose the most likely response, i.e., if \(p(C=1|X_\mathrm{V},X_\mathrm{A}) > 0.5\) choose \(C = 1\) (Körding et al. 2007).
1.1.4 Probability matching
A second possible approach is to choose each response in proportion to its probability, i.e., if \(p(C=1|X_\mathrm{V},X_\mathrm{A}) = 0.7\) choose \(C = 1\) 70 % of the time (Wozny et al. 2010).
1.2 Performing the auditory localization task
In the auditory localization task, subjects use a laser to point to the estimated location of the auditory target, \({\hat{S}}_\mathrm{A}\). Estimating the auditory target location requires integrating the percept from each sense, \(X_\mathrm{A}\) and \(X_\mathrm{V}\), with the probability they originated from a common source, \(p(C=1|X_\mathrm{V},X_\mathrm{A})\). \({\hat{S}}_\mathrm{A}\) is calculated for both \(C = 1\) and \(C = 2\) (Eqs. 15 and 16 (Körding et al. 2007)), then combined in one of three ways: averaging, model selection, and probability matching (Wozny et al. 2010). As before, these models can be directly compared for a given data set.
1.2.1 Averaging
One approach to estimating auditory target location is to perform a weighted average of \({\hat{S}}_{\mathrm{A},C = 1}\) and \({\hat{S}}_{\mathrm{A},C = 2}\) based on their probability (Körding et al. 2007):
1.2.2 Model selection
As in the congruence judgment task, subjects may simply choose the most likely explanation for the observed signals (either \(C = 1\) or \(C = 2\)) and respond based solely on that explanation (Wozny et al. 2010).
1.2.3 Probability matching
Also as in the congruence judgment task, subjects may choose explanations at a rate commensurate with their probability (Wozny et al. 2010).
1.3 Simulating task performance
The above equations can be used to estimate the probability distribution of responses for a given set of inputs, model parameters, and task strategy. Responses are simulated via repeated sampling perceived target locations, \(X_\mathrm{V}\) and \(X_\mathrm{A}\), from Eqs. 3 and 4 10,000 times. Each pair of perceived target locations is used to calculate the probability of a common cause, \(p(C=1|X_\mathrm{V},X_\mathrm{A})\), from Eq. 8. \(X_\mathrm{V}, X_\mathrm{A}\), and \(p(C=1|X_\mathrm{V},X_\mathrm{A})\) for each sample are then used to compute a response from one congruence judgment or auditory localization task strategy (C or \({\hat{S}}_\mathrm{A}\), respectively). Additionally, it is possible that on some trials a subject may lose focus on the task, and simply guess. To model this, responses are replaced with a guess based solely on prior expectation (for congruence judgment, \(p(C=1|X_\mathrm{V},X_\mathrm{A}) = p_\mathrm{common}\), and for auditory localization, \({\hat{S}}_\mathrm{A} \sim \mathcal (N)(\mu _\mathrm{P},\sigma _\mathrm{P})\)) at a probability equal to the inattention rate parameter (\(\lambda \)). This produces a total of 10,000 simulated responses, which can be used to estimate the probability distribution of responses. This probability distribution can be compared to subject data to estimate the likelihood of an observed subject response occurring given a set of model parameters and task strategy. For congruence judgment, the histogram only had two values, \(C = {1,2}\), while for auditory localization responses were binned in 1 degree intervals.
1.4 Comparing common underlying processes across both tasks
The ultimate goal of this model is to be able to compare perception across task types, in order to identify any differences in visual capture across task types. If we assume that the model parameters described above are sufficient to describe visual capture, than this goal can be reframed as comparing the model parameters that best explain observed responses across task types. Therefore, we need to find the model parameters and task strategy that best explain each data set.
Table 4 provides the limits for each parameter in this model. For each data set, we searched for a set of parameter values that maximizes the likelihood of the observed data in that set. Specifically, a MATLAB pattern search algorithm was used to search for a set of parameters that minimized negative log-likelihood. Negative log-likelihood was computed by estimating the probability distribution of responses to each target pair via the simulation described above and then summing the negative log of the probability of the response given for each target pair. Essentially, this penalized a parameter set for predicting that a set of subject responses has a low probability of occurring. If the negative log-likelihood was ever infinite (i.e., the probability of a response was zero), it was replaced by a large, non-infinite value (10,000) to prevent the search algorithm from failing. The search algorithm starts from a random point (independently drawn for each parameter from a uniform distribution between the limits in Table 4) in the parameter space, computes the negative log-likelihood of a set of ordinal points (referred to as a mesh) centered on the current point, and selects the point with the smallest negative log-likelihood as the new current point. If no point with a smaller negative log-likelihood than the current point is found, the size of the mesh is reduced and the search is repeated with the same point (Hooke and Jeeves 1961). The search algorithm stops when the change in mesh size, change in current point, or change in likelihood value drops below a set tolerance (\(1 \times 10^{-4}\) for Mesh size, \(1 \times 10^{-4}\) for current point, and \(1 \times 10^{-2}\) for likelihood). Unimodal localization trials were included in optimization, with probability in unimodal localization trials computed analytically from Eqs. 3 and 4.
This approach is complicated by the fact that the objective function is stochastic, which may lead to different best model parameters each time the minimization algorithm is run. Pattern search is an ideal algorithm for this condition, since it does not compute derivatives (Hooke and Jeeves 1961) and is therefore robust for stochastic objective functions. However, it is still capable of finding different solutions or getting stuck in local minima each time it is run. To address this concern, the search was run 120 times from a random starting point each time. To ensure the parameter space was well conditioned, the space of parameter values was normalized relative to the upper and lower bounds for each parameter in the search algorithm and then un-normalized when computing likelihood. Medians, and 95 % confidence intervals were computed for each parameter from the search results.
Rights and permissions
About this article
Cite this article
Bosen, A.K., Fleming, J.T., Brown, S.E. et al. Comparison of congruence judgment and auditory localization tasks for assessing the spatial limits of visual capture. Biol Cybern 110, 455–471 (2016). https://doi.org/10.1007/s00422-016-0706-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00422-016-0706-6