Introduction

It has been almost unanimously agreed that the cross-modal timing between two stimuli plays a key role in multisensory processing 1,2 (see Koelewijn2 for a review). An audiovisual disparity, or stimulus onset asynchrony (SOA), of ~ 100 ms could substantially impede the perception of simultaneity 3,4,5 and provided sufficient information for temporal order judgement 6,7. The improvement on the performance of perception (e.g., reaction time or accuracy) by adding stimulus from a second modality is also diminished with increasing audiovisual SOA 8,9,10,11,12. Such time sensitivity indicates that the complexity of neural circuits that are not fully understood yet is involved in audiovisual interactions, and potentially cross-modal plasticity after hearing loss.

The range of SOAs up to 100 ms, which cross-modal temporal processing (simultaneity and temporal order judgement) is sensitive to 13,14,15, has been studied in human ERP and MEG experiments 16,17,18,19,20,21,22. We refer to SOAs of this range as “short SOAs” and both types of studies have shown that short SOAs can modulate the multisensory component of ERP activities. However, longer SOAs were not extensively studied in these human ERP experiments.

Using extracellular recording or behavioral measurements, a few investigations have shed some light on the effect of long SOAs in multisensory processing. In macaque primary auditory cortex, Lakatos et al. 23 showed that neuronal activities evoked by a click were modulated by a preceding tactile stimulus with up to about 800-ms SOA. Fiebelkorn et al. 24 measured the fluctuated behavioral performance in detecting a near-threshold Gabor stimulus after a preceding tone beep up to a 6-s SOA. The findings in both studies have implied that the effect of long SOAs on multisensory interaction is due to the oscillations in the cortical excitability phase-locked to the preceding stimulus. This would be contradictory with the evoked model 25, where stimulus-evoked neural activity by the preceding stimulus may have a more limited effective period. Thus, we hypothesized that, in auditory ERPs, cross-modal modulation originating from a visual input should also occur with audiovisual temporal disparity beyond the range sensitive for multisensory temporal processing, where a periodic pattern of fluctuation may be observed.

The existing ERP studies on the temporal disparity of audiovisual integration provided very limited information specific to long SOAs and its spectral patterns 26,27,28,29,30,31,32. To fill this research gap, the current study is aimed at providing unparalleled evidence of the interaction between cat ERPs in response to auditory (click) and visual (flash) stimuli and audiovisual SOAs up to 1 s (Fig. 1). We found that the amplitude of N1 from cortical auditory evoked potentials (cAEPs) in cat under dexmedetomidine sedation was affected by audiovisual SOAs. Change in N1 amplitude as a function of SOA revealed a temporal dynamic of visual modulation in an oscillatory pattern.

Figure 1
figure 1

Stimulus paradigm and click grouping based on flash-to-click delays. (a) Three stimulus conditions, audiovisual (AV), auditory only (A) and visual only (V) were presented 10 times to each subject while EEG signal was continuously recorded. Same click train and flash train were repeatedly used in all three conditions. (b) EEG signal from the V condition was subtracted from AV condition in each repeat to generate an AV-V condition. For both AV-V and A conditions, epochs time-locked to click onsets were extracted and were averaged to derive cortical auditory evoked potentials (cAEPs). (c) For investigating the effect of audiovisual temporal disparity, clicks were sorted by flash-to-click delays and grouped into different bins. This way, cAEP waveforms can be obtained separately from different click groups. Flash-to-click delays overall spanned from 0 to about 1000 ms.

Results

Cats under dexmedetomidine sedation were presented with 1-min trains of clicks (auditory, A), flashes (visual, V), and unsynchronized clicks and flashes (audiovisual, AV) (Fig. 1a). Then, an offline bandpass filter (1–10 Hz) was applied for obtaining cortical auditory evoked potentials (cAEPs). First, we extracted epochs time-locked to click onsets from all three stimulus conditions. The grand-averaged waveforms derived from both the AV and the A conditions revealed clear cortical auditory evoked potentials (cAEPs) but not the waveform from the V condition (Fig. 2a). The flash stimuli did not seem to influence the grand-averaged waveforms of click cAEPs, due to the fact that flash and click stimuli were out of sync. This, however, may not be the case, when specific flash-to-click delays were to be investigated. Therefore, EEG signals from V condition were subtracted from the corresponding AV condition in each of the 10 repeats, generating an AV-V condition (Fig. 1b). For further data analysis, epochs were extracted from the derived AV-V and the original A conditions, respectively, for waveform averaging and peak measurements (Fig. 2b).

Figure 2
figure 2

Cortical auditory evoked potentials (cAEPs) from all stimulus conditions. (a) Grand-averaged waveforms of cAEP in three stimuli conditions. The epochs were averaged with click onsets. Note that in the case of visual-only (V) condition, the click onsets were the same as in the auditory-only (A) and the audiovisual (AV) condition, despite that no click was presented. (b) Contrast of cAEP waveforms between the A and the AV-V conditions. Inset, an enlarged view of the waveform near the click onset and the baseline between the two vertical lines (from 5-ms before to 5-ms after click onsets).

Data were collected from 14 cat subjects. Regardless of the flash-to-click delay, each subject was presented with 370 clicks with 10 repeats, giving rise to an average of 3700 epochs in each individual cAEP waveforms (Supplementary Fig. 1a).The cAEP waveforms from the both conditions featured a prominent positive peak component about 35-ms latency, which we referred as P1, followed by a slower and wider negative peak component at about 95-ms latency post-click, which we referred as N1 (Fig. 2b). A second positive peak component, less prominent than P1, was present at about 170-ms latency, which we referred to as P2. From the grand-average waveforms, we observed a near-perfect overlap between the AV-V and the A conditions, especially for the initial 125-ms duration after click onset, suggesting a well-preserved cAEP morphology when unsynchronized visual stimuli were simultaneously present. There appeared to be an elevation of the traces starting at 150-ms after click onset in the AV-V condition.

The P1-N1-P2 complex were observed in all subjects. The amplitudes and the latencies were measured from each of the three peak components (Table 1 and Supplementary Fig. 1a). Only P2 amplitude was significantly larger in the AV-V condition than the A condition (ΔampP2 = 0.14, p = 0.007 < 0.01). It was noticed in the later analysis that three subjects demonstrated more noise in their recordings. It became more apparent in peak identification, when cAEPs were analyzed in separate click groups according to the flash-to-click delay (Supplementary Fig. 1b). Excluding these three subjects, however, did not change the result of comparisons between the AV-V and the A conditions above (ΔampP2 = 0.18, p = 0.005 < 0.01). Although P2 amplitude demonstrated the effect of visual modulation without depending on the timing between flash and click stimuli, P1 and N1 components, as well as P2 latency, did not, which is consistent with the existing knowledge that out-of-timing visual stimulus does not affect auditory processing 33. To investigate how stimulus timing plays a role in the effect of visual modulation, we focused on N1 amplitude as the major measurement in the following data analysis.

Table 1 Amplitudes and peak times of P1-N1-P2 complex in individual subjects.

The effect of flash-to-click delay on visual modulation of cAEPs

To examine the relationship between audiovisual temporal disparity and visual modulation of auditory processing, we sorted all the click stimuli by their flash-to-click delays (Fig. 1c). At first, we created 8 groups of with equal number of clicks in each group. In this case, the first click group was composed of clicks with a flash-to-click delay between 0 and 79 ms, while the last group was composed of clicks with a flash-to-click delay between 894 and 1731 ms. Detailed descriptive statistics on the flash-to-click delays were listed below (Table 2). Next, the cAEP waveforms were derived from each of the 8 click groups (Supplementary Fig. 1b), and therefore the contrast between the A and the AV-V conditions for each click group can represent for the cortical processing of click stimuli under the influence of visual modulation with a specific window of audiovisual temporal disparity.

Table 2 Descriptive statistics about the flash-to-click delays in each of the eight click groups.

We first compared the range of N1 amplitudes across the 8 click groups. It appeared that there was a larger range of N1 amplitude across the 8 click groups in the AV-V condition than the A condition (Supplementary Fig. 1c), although this difference was not statistically significant.

Next, one-way repeated-measure ANOVA was performed to test the statistical effect of click group on the change of N1 amplitude (ampN1) against the variance across subjects. We found a significant main effect of click group (F10, 70 = 2.72, p = 0.015 < 0.05). Given the small sample size, we also carried out a permutation test, where the correspondence between the click groups and the ΔampN1 were randomly scrambled for each subject independently. This allowed us to determine a false discovery rate of 1.0% when accepting 0.015 as the alpha level.

To further identify the specific click groups that demonstrated delay-dependent visual modulation, we performed Wilcoxon sign rank tests in each of the 8 click groups, comparing ΔampN1 with either 0 (i.e., assuming no visual modulation at all as the null hypothesis) or the ΔampN1 derived from each subject without click grouping (i.e., assuming no delay dependency as the null hypothesis). In both approaches, a significant suppression of N1 amplitudes, as indicated by a positive ΔampN1, was found for the 34-ms click group and the 198-ms group click group (Fig. 3). Again, we used the same permutation procedure described above to confirm that accepting both positive findings (34-ms: p = 0.008 < 0.01; 198-ms: p = 0.013 < 0.05) yielded an accumulated false discovery rate of 0.3% when ΔampN1 values were compared to zero. The other click groups failed to reveal a statistically significant visual modulation, suggesting that visual modulation in those ranges of audiovisual temporal disparity was less consistent across subjects. We also explored the other number (from 2 to 12) for click grouping and found that the pattern how visual modulation of N1 amplitude depends on audiovisual temporal disparity can be consistently observed using 7-bin, 8-bin, 9-bin, 10-bin, 11-bin grouping of clicks (Supplementary Figs. 2 and 3).

Figure 3
figure 3

Effect of audiovisual temporal disparity on visual modulation of N1 amplitude. Median of change in N1 amplitude for each of the 8 click groups. The median of flash-to-click delays were used as horizontal coordinates. Errorbar, half of the inter-quartile range across subjects. The red-dash line, the null hypothesis with no visual modulation. Blue errorbar, the inter-quartile range of ΔampN1 across subjects without click grouping.

Visual modulation of N1 amplitude predicted by audiovisual temporal disparity

Finally, we adopted from kernel regression procedure for weighing each of the 370 click epochs to predict the cAEP waveforms specific for a given audiovisual temporal disparity (audiovisual SOA), which we also termed as a Gaussian-weight averaging approach (Supplementary Fig. 4). For any given SOA, epochs were averaged with weight values derived from a Gaussian kernel centered at this SOA. The bandwidth of the Gaussian kernels was controlled by the parameter σ, which was selected to be 100-, 50-, 20-, 10-, 5-ms (Fig. 4a–e), concerning the trade-off between bias and variance of the prediction. Similarly, N1 amplitudes were measured and contrasted between the A and the AV-V conditions. The temporal course of visual modulation in N1 amplitude can be characterized by directly plotting ΔampN1 as a function of audiovisual SOA (Fig. 4a–e, Left).

Figure 4
figure 4

Visual modulation of N1 amplitude depends on audiovisual temporal disparity. (ae) For kernels with different bandwidth (σ), change in N1 amplitude as predicted by audiovisual SOA derived from Gaussian-weight averaging of cAEPs. Left, the original ΔampN1. Right, proportion of permutation-derived ΔampN1 smaller than the original ΔampN1. Dotted line, peak detection with large variance indicated by latency beyond 150 ms or less than 55 ms.

The lack of clicks with long flash-to-click delays exerted additional variance to the prediction near the end of the evaluated SOA range. To alleviate its interference, we obtained the proportion of greater ΔampN1 than the data obtained through 1000 permutations, where all the flash-to-click delays were randomly assigned to the 370 click epochs (Fig. 4a–e, Right). Additionally, to monitor the quality of peak detection, N1 latency was measured at the same time.

Using the kernels with a large bandwidth (σ > 20 ms), we observed an overall transition from visual suppression to facilitation of N1 amplitude at ~ 300-ms SOA (Fig. 4a–c). Using the kernels with a smaller bandwidth, an early and transient facilitation can be identified at ~ 100-ms SOA (Fig. 4c–e). Such temporal dynamic was also partially captured by the analysis demonstrated earlier where the clicks were grouped in discrete bins. Furthermore, strong visual modulation on N1 amplitude was also revealed at multiple SOAs like 300- and 400-ms, when the kernels with a small bandwidth were used (Fig. 4d), suggesting multiple temporal integration windows for audiovisual interaction.

Discussion

In this study, we examined and demonstrated the effect of audiovisual temporal disparity or stimulus onset asynchrony (SOA) on visual modulation of cortical auditory evoked potentials (cAEPs). The audiovisual interaction was investigated using similar approaches in two previous human ERP studies, with SOAs below 100 ms 17 and 70 ms 16, respectively. A few studies using extracellular recordings examined SOAs up to 500 ms in the superior colliculus 1 and 320 ms in auditory cortices 34. These studies have made the discoveries of the neural correlates to the “temporal window of integration” that were measured behaviorally, demonstrating strong evidence for a “coincidence detector” as a neurophysiological mechanism 35,36,37.

Long SOAs, despite not likely being involved with the temporal integration or temporal processing (perception of multisensory simultaneity and temporal order), are still possible for effective cross-modal modulation of sensory processing. This idea has been supported by both behavioral data 24,38,39 and some neurophysiological evidence 23,40. Lakatos et al. 23 pointed out that the optimal SOAs for tactile modulation of sound-evoked neuronal activities in their data were associated with the periodic intervals of several EEG oscillations. According to the “phase reset” hypothesis they proposed, a preceding tactile stimulus resets the phase of ongoing neural oscillations in the primary auditory cortex, which in turn determines the state of fluctuating auditory excitability. When the SOA between the preceding tactile stimulus and the following auditory stimulus is aligned to the high-excitability, up-phase of neural oscillation, the auditory stimulus evokes a larger response than when tactile-auditory SOA is aligned to the low-excitability, low-phase of neural oscillation. The observation of excitability fluctuation has been further evidenced with various behavioral and electrophysiological measurements, including extracellular recording 34, human ERP 16, phosphine induced by transcranial magnetic stimulation 40,41, and reaction time 24,38,39. Although our analysis was mainly focused on the prediction of visual modulation by audiovisual temporal disparity, the result did exhibit a pattern of fluctuating suppression/facilitation as SOA increased from 0 to 1000 ms. It is worth noting that neither auditory nor visual stimuli in this study was dedicated as a periodic inputs. Therefore, the oscillation in visual modulation we observed may reflect an intrinsic property of neural networks.

One of the many missions of the future multisensory research is to converge the knowledge established from extracellular recordings in animal models and from whole-brain imaging in humans. While data of intracranial recordings in human are still rare and challenging to obtain, scalp-EEG recordings from large animal models are quickly developing as a uniquely useful neurophysiological approach, such as marmoset 42,43 and cat 44,45,46,47,48.

Electrical and magnetic mappings of whole-brain activities during audiovisual perception have provided valuable insights on its neural mechanism involving intra-cortical functional connectivity49 and topographic re-distribution 26. Human auditory evoked potentials have been well-characterized for a variety of components as neural correlates to sound processing at different stages of ascending auditory pathway 50,51. The current study is the first scalp-recorded EEG multisensory study in animal models, and is, infrequently in literature, focused on auditory evoked potentials under visual modulation. We compared ERPs from the auditory-only condition with a derived condition by subtracting signal of the visual-only condition from the audiovisual condition, rather than compare the difference between audiovisual condition with a derived condition by “sum of the auditory and the visual conditions”. This allowed us to select peak components time-locked to auditory stimuli, which are supposed to have better interpretability for auditory processing.

To summarize, in this study we mainly characterized N1 amplitude in scalp-recorded auditory evoked potentials (AEPs) from cats under dexmedetomidine sedation as a measurement for visual modulation of auditory processing. We found that the delay function, sampled with both sparse grouping approach and fine-resolution weight-average approach, revealed a short-SOA effect peaking at ~ 100 ms, which was followed by a long-SOA effect characterizing the time course of visual modulation over ~ 1-s period. With the advantages of our animal models and experiment paradigms, future studies are expected to characterize the spectrotemporal features in normal and sensory-deprived subjects and to identify the neural mechanism underlying cross-modal interactions.

Methods

All procedures were conducted in compliance with the National Research Council's Guide for the Care and Use of Laboratory Animals (8th edition; 2011), the Canadian Council on Animal Care's Guide to the Care and Use of Experimental Animals (1993), and the ARRIVE guidelines. Furthermore, the following procedures were also approved by Animal Care Committee (DOWB) for the Faculty of Medicine and Health Sciences at McGill University.

Animal preparation and anesthesia protocol

Cats (felis catus) were obtained from a commercialized animal breeder for biomedical research (Marshall Bioresources). We recorded 14 cats with average age of 4.7 ± 1.5 years old, two of which were male. After subjects were sedated using dexmedetomidine (0.04 mg/kg, Dexdomitor, Zoetis) injected intramuscularly, the left eye was occluded using a black contact lens so that visual stimuli were presented unilaterally. Phenylephrine (Mydfrin, Alcon) was applied to the right eye to dilate the pupil, and saline drops were used as lubrication. Subjects were placed on a water-circulated heating pad (TP-400, Gaymar). Once vital signs (heart rate and SpO2) were stable, two 15-min recording sessions were carried out while the subject was breathing pure oxygen (Dispomed). At the end of the two recording sessions, data collection terminated in nine subjects and continued in the other five under isoflurane anesthesia for a separate study. Subject’s vital signs and electrode impedance were checked between the two sessions. At the end of data collection, electrodes and contact lens were removed before atipamezole (Antisedan, Zoetis) was administrated intramuscularly to facilitate recovery from the dexmedetomidine sedation.

Visual and auditory stimuli

The visual stimuli consisted of flashes that were presented to subjects from a 5-mm-diameter light-emitting diode (~ 11 degrees of visual field, LED, DigiKey). The intensity of flash stimuli was calibrated to 10 cd/m2 by adjusting the voltage magnitude of a 300-us-long squared pulse as the input signal to the LED. The auditory stimuli were 300- μs-long clicks emitted by an 8-cm-diamter loudspeaker (Fostex). The sound level of the click stimuli was calibrated to 55 dB SPL using a sound meter (Model 2250, B&K). Both auditory and visual stimulation signals were generated by the same digital-to-analogue processor (RZ2, TDT). The LED was attached to the top of the loudspeaker and placed 8-cm away from the subject at the direction of 45 degrees right to the midline.

To manipulate the timing of auditory and visual stimulus, two independent, 57-s-long pulse trains for triggering clicks and flashes, respectively, were pre-made in Matlab using a Poisson random process and loaded into the stimulus/recording software (Synapse, TDT). The auditory stimulus train contained 370 click pulses and the visual stimulus train contained 70 flash pulses. The minimal inter-click interval was set to 20 ms and the minimal inter-flash interval was set to 500 ms. The auditory and the visual stimulus trains always started and stopped simultaneously in each session. Auditory only (A), visual only (V), and audiovisual (AV) stimulus trains were played alternatively in order and repeated for 10 times.

Since click train and flash train were “out of sync”, a flash-to-click delay can be determined for each of the 370 clicks as the retrospective interval between the click onset and the onset of the immediately preceding flash (Supplementary Fig. 5). The flash-to-click delay spanned from 0 to beyond 1000 ms, although it does not conform to a uniform distribution.

EEG recording and signal processing

Three 25G stainless steel needles were placed subdermal as recording electrodes (Supplementary Fig. 6). The active electrode was placed near the midpoint of subject’s interaural line, while the reference electrode was placed below the right ear (ipsilateral to the side of visual stimulation). The ground electrode was placed on the subject’s dorsum (~ 10 cm behind shoulder blade near the midline). The impedance of both active and reference electrodes was maintained below 3 kΩ during recording. The signal was amplified and digitized with a pre-amplifier (Medusa4Z, TDT), streamed onto a digital signal processor (RZ2, TDT), and stored on a computer hard drive. The analogue signal was digitized at a sample rate of ~ 6.1 kHz and passed through an anti-aliasing filter between 0.1 Hz and 1830 Hz.

All data analysis was performed offline. Signal was digitally notched at 60 Hz before passing through a band-pass filter (1–10 Hz) for cortical auditory evoked potentials. Then, the filtered signals from the same stimulus conditions were averaged. For AV-V condition, AEPs were derived from subtracting visual only (V) session average from audiovisual (AV) session average. For A condition, AEPs was derived from auditory only (A) session average.

Flash-to-click lags were calculated for each individual click as the delay of its onset to the onset of its preceding flash for audiovisual stimulus. Epochs were extracted between 200-ms pre-click and 400-ms post-click.

In time-binned sub-group averaging, epochs were ordered ascendingly by flash-to-click lags. Taking 8-bin grouping as an example, bins were created for every 46 epochs and labeled as the median flash-to-click lags. The first 368 epochs were included, with the remaining 2 epochs discarded. In Gaussian-weight averaging, SOAs were selected from 0- to 1000-ms with a 5-ms step. For each SOA, a Gaussian kernel function with one of the five bandwidths (σ = 5, 10, 20, 50, 100 ms) was centered at the SOA. Clicks within ± 3 σ range were included into the average with weight values given by the Gaussian kernel functions. Click epochs with flash-to-click lags more deviating away from the SOA (i.e., the peak of Gaussian kernel) therefore contributed less to the averaged waveform.

Extraction of N1 amplitude

First, a peak latency of N1 was determined from all click responses averaged together, which was then used as a reference. To find the peak of N1 in cAEP waveforms derived from sub-groups of clicks, we customized a Matlab program that identified all local minima on each waveform and selected the minima with the closest latency to the reference latency previously determined. The amplitude of N1 was measured in reference to the baseline (from 5-ms before to 5-ms after the click onsets).

Statistics

Repeated-measure ANOVA and Wilcoxon sign rank test were performed on Matlab using Statistics and Machine Learning Toolbox™. Permutation tests were performed using customized Matlab programs. For Gaussian-weight, N1 amplitudes were derived using the grand average across subjects. To test for statistical significance, 1000 permutations were performed by randomizing the mapping between the epochs and their flash-to-click delays.