In everyday situations, we perceive common objects via multiple senses. For humans, vision and audition are arguably most important in this regard. Despite recent efforts to elucidate where and how auditory and visual object features are integrated in the brain into coherent audio-visual (AV) representations, several important questions have remained unanswered. In particular, there is an ongoing debate about whether such integration predominantly occurs at higher levels of cortical processing (i.e., in so-called heteromodal regions; Calvert 2001; Beauchamp 2005a; Amedi et al. 2005; Hein et al. 2007; Doehrmann and Naumer 2008; Naumer et al. 2009; Werner and Noppeney 2010a) or at rather low-level cortical processing stages, i.e., in regions traditionally assumed to serve strictly unisensory functions (Schroeder and Foxe 2005; Ghazanfar and Schroeder 2006; Macaluso 2006; Kayser and Logothetis 2007; Meienbrock et al. 2007; Driver and Noesselt 2008; Doehrmann et al. 2010). Irrespective of whether multisensory integration mainly takes place in parallel or subsequent to unisensory processing, most researchers agree that multisensory object perception generally involves networks of widely distributed brain regions (Naumer and Kaiser 2010). Within those distributed neural representations, the issue of functional connectivity (i.e., networks of co-activated regions) has been rather neglected. Closing this gap appears to be especially relevant with regard to audio-visual (AV) processing of common objects, as it involves the integration of both multiple higher-level stimulus features and semantic memory processes (Doehrmann and Naumer 2008).

Functional connectivity is typically formalized as the timepoint-by-timepoint covariation between activation time courses of pairs of spatially separated brain regions (Friston et al. 1993). Investigating patterns of covariations between brain regions may provide information on how these regions specifically interact in different contexts, such as different stimuli, task instructions, cognitive sets, or mental states (Friston et al. 1993; Rogers et al. 2007). Recently, it has been demonstrated that the application of spatial independent component analysis (sICA, McKeown et al. 1998) to human functional magnetic resonance imaging (fMRI) data can provide a robust non-invasive measure of functional connectivity (van de Ven et al. 2004; Bartels and Zeki 2005; Rajapakse et al. 2006; Rogers et al. 2007). In spatial ICA, “spatial independence” refers to the assumption of statistical independence between spatially distributed processes, which combine linearly to constitute the measured functional time series. In fMRI, spatial ICA aims to estimate a weighting matrix of the data that will project the data into a space in which the spatial modes are as independent as possible, while leaving the timecourses of the spatial modes unconstraint. This is done by maximizing or minimizing some objective criterion, such as minimizing mutual information (Bell and Sejnowski 1995) or maximizing negentropy (Hyvärinen 1999). Spatial maps are then interpreted as maps of functional connectivity—with maximized independence between maps being similar to high dependence within maps. For example, we have used a group-level ICA approach to reveal networks of functionally connected cortical regions involved in overt speech production and speech monitoring (van de Ven et al. 2009).

In the present study, we used sICA to map the cortical AV object perception network by means of functional connectivity and used the results to predict their associations to uni- and multisensory processing in an independent second dataset. FMRI data of a passive AV experiment (experiment 1) were decomposed individually for each subject (fully data-driven) into spatial independent components (ICs) and clustered in the subject space using an extension of self-organized grouping ICA (sogICA) (Esposito et al. 2005; van de Ven et al. 2009) to obtain a representation of the spatial modes and associated time courses on the group level (schematically illustrated in Fig. 1b. The statistics of spatial modes and time courses can then be further investigated using random-effects-like statistics, such as t-tests of component values across participants. The group-level connectivity modes were then classified according to spatial (presence of key uni- and multisensory brain regions in the spatial modes) and temporal information (using the knowledge about the sequence of experimental conditions during the first experiment) as auditory, visual, or multisensory networks. Due to the weighted mixing of all independent components into the measured fMRI data (McKeown et al. 1998; Calhoun et al. 2001; van de Ven et al. 2004; Fig. 1c), we hypothesized that the voxel time courses in potential AV integration regions should mainly reflect substantial contributions of at least two of these three spatial connectivity maps. More specifically, we investigate possible relations between components, by looking at compartments of their spatial distribution that overlap. We show that these overlaps can be explained in a meaningful way: Unisensory regions—found in unisensory ICA maps—can show multisensory effects, with overlap between unisensory components indicating low-level interactions, and overlap between uni- and multisensory regions indicating an interaction of low- and higher level processes. Thus, sogICA of the first experiment allowed us to reveal a bilateral network of multisensory candidate regions including superior temporal (pSTS), ventral occipito-temporal (VOT), ventro-medial occipital (VMO), posterior parietal (PPC), and prefrontal cortices (PFC). In order to explicitly test these regions for their integrative capacities, we conducted a region-of-interest (ROI)-based analysis of an independent second AV experiment using a conventional general linear model (GLM)-based approach. We hypothesized that activation in all ROIs should fulfill the max-criterion for AV convergence (AV > max[A, V]). Based on the recent literature on effects of semantic congruency versus incongruency during AV object processing (Lewis 2010; Meyer et al. in press; Noppeney et al. 2010; van Atteveldt et al. 2010), we expected higher BOLD signal increases for semantically congruent AV stimuli in pSTS and VOT ROIs and for incongruent AV pairings in VMO, PPC, and PFC ROIs, respectively.

Fig. 1
figure 1

Relationship between spatial independent component analysis (sICA) and voxel-based GLM analysis. FMRI raw data (a) were decomposed into spatially independent components (b) that, mixed together (c) reproduced the data for each voxel. Otherwise, the data can be analyzed using the knowledge on the stimulation time course (d) via a hypothesis-driven voxel-based GLM, resulting in statistical information based on the individual voxel time courses (e). SICA a results in spatially independent maps b, left column that cover the whole geometrical extent of the raw data and contain weights that vary strongly within each map, such that clusters of voxel weights may appear after thresholding. Each map is associated with a single time course b, right column. When these component time courses are tested using the knowledge about the stimulation time course d the components may be classified as mainly auditory, visual, or AV (among others like physiological components related to breathing, heartbeat, etc.). The voxel time courses (e) can be thought of as the sum of all component time courses weighted by the values of the respective component maps at that voxel (c). This can result in a variety of voxel characteristics: voxels in a region where only one spatial component has large map values will show a time course very similar to the respective component time course, e.g., mainly auditory (voxel 1 in c and e, top row) or mainly visual activation (voxel 2 in c and e , second row). Due to the weighted mixing of components c, both visual and auditory unisensory components can contribute equally to a voxel time course (voxel 3 in c and e, third row). In a GLM analysis, the effects of auditory and visual stimulation may be simply additive at this voxel. If the mixing comprises non-zero coefficients for components that describe purely multisensory processing, i.e., processing that is absent during purely unimodal stimulation, the respective voxel might show superadditive effects (voxel 4 in c and e , bottom row). Inferences of sICA results refer to systems-level (i.e., multivariate) behavior, whereas inferences of GLM results refer to voxel (or voxel-cluster) behavior. M multisensory

Materials and methods


Twelve subjects (three female) participated in this study; their mean age was 28.8 years (range 21–38 years). All subjects had normal or corrected-to-normal (four subjects) vision. Of these subjects, 10 participated in experiment 1 and six in experiment 2. All participants received information on MRI and a questionnaire to check for potential health risks and contraindications. Volunteers gave their written informed consent after having been introduced to the procedure in accordance with the declaration of Helsinki.


Visual stimulation consisted of eight gray-scale common object photographs (mean stimulus size 12.8° visual angle). Each visual stimulation block consisted of eight photographs that were presented in the center of the screen at a rate of 0.5 Hz. In the center of the white screen, a black fixation cross was displayed during the entire experiment. Auditory stimulation consisted of complex sounds related to the same eight common objects. Each auditory stimulation block consisted of eight of these sounds that were presented at a rate of 0.5 Hz.


In both experiments, stimuli were presented in a block design with a block length of approximately 16 s (eight measurement volumes), separated from the next stimulation block by a fixation period of equal length. In experiment 1, we employed the following conditions: common sounds (A), common sounds played backwards (A-bw), gray-scale images of common objects (V), AV combinations that were semantically congruent (CON), and AV combinations that were semantically incongruent (INC) (see Fig. 2 for an overview). In addition to the A, V, and CON conditions of experiment 1, experiment 2 comprised two different types of semantically incongruent AV combinations consisting of auditory and visual stimuli stemming either from the same (“low incongruency,” INL) or from different object categories (“high incongruency,” INH). Both experiments consisted of two runs each. Within each run, each of the experimental conditions was repeated four times. While subjects were asked to fixate and be attentive during experiment 1, they had to perform a repetition detection task in experiment 2.

Fig. 2
figure 2

Experimental conditions. We employed the following experimental conditions: unimodal auditory (yellow), unimodal visual (blue), semantically congruent audio-visual (AV; light green), semantically incongruent AV stimuli (from the same semantic category; medium green), and semantically incongruent AV stimuli (from different semantic categories; dark green)


FMRI scanning was performed on a 1.5 Tesla Siemens Magnetom Vision scanner (Siemens, Erlangen, Germany) at the Institute of Neuroradiology of Frankfurt Medical School. An echo-planar-imaging (EPI) sequence was used with the following parameters: 16 slices, oriented approximately in parallel to the AC-PC plane (AC, anterior commisure; PC, posterior commissure); TR, 2081 ms; TE, 60 ms; FA, 90°; FOV, 200 mm; in-plane resolution, 3.13 × 3.13 mm2; slice thickness, 5 mm; gap thickness, 1 mm. In addition, a detailed T1-weighted anatomical scan was acquired for all subjects using a Siemens fast low-angle-shot (FLASH) sequence (isotropic voxel size 1 mm3). For each subject, an additional magnetization-prepared rapid-acquisition gradient-echo (MP-RAGE) sequence was used (TR = 9.7 ms, TE = 4 ms, FA = 12°, matrix = 256 × 256, voxel size 2.0 × 1.0 × 1.0 mm3) in each fMRI scanning session for later realignment with the detailed anatomical scan that had been measured in a separate session.

Data analysis


Data were preprocessed using the BrainVoyager™ QX (version 1.8) software package (Brain Innovation, Maastricht, The Netherlands). The first four volumes of each experimental run were discarded to preclude T1 saturation effects. Preprocessing of functional data included the following steps: (1) linear trend removal and temporal high-pass filtering at ~ 0.01 Hz (2) slice-scan-time correction with sinc interpolation, (3) spatial smoothing using Gaussian kernels of 6 mm (experiment 1) and 8 mm (experiment 2), and (4) three-dimensional motion correction (only for experiment 2). The functional data were then resampled into a 3-dimensional standardized space (Talairach and Tournoux 1998) with a resampled voxel size of 3 × 3 × 3 mm3.

Hypothesis-generating functional connectivity analysis of experiment 1

Functional connectivity modes of the time series of experiment 1 were analyzed using an extension of a multi-subject data-driven analysis (sogICA, Esposito et al. 2005; van de Ven et al. 2009) in Matlab (Mathworks Inc.) using freely available toolboxes (FastICA, Hyvärinen 1999; Icasso, Himberg et al. 2004) and custom-made routines. Individual runs were decomposed using spatial ICA (McKeown et al. 1998; Calhoun et al. 2001; van de Ven et al. 2004) into 35 spatially independent components and associated activation profiles, than clustered in a data-driven hierarchical sense (over runs, then over subjects) based on similarity between component pairs, with an average group-level activation profile. See Electronic Supplementary Material for more details. Selection of target maps obtained from a data-driven analysis can be done by utilizing (a combination of) spatial (van de Ven et al. 2004, 2009; Greicius et al. 2003; Castelo-Branco et al. 2002) or temporal hypotheses (McKeown et al. 1998; Calhoun et al. 2004; Moritz et al. 2003). Spatial templates were obtained as masks in which voxels belonging to key regions were set to 1 and all other voxels to 0. Separate spatial templates were generated for visual, auditory, and posterior parietal cortex from an independent dataset (van de Ven et al. 2004). Temporal hypotheses comprised the haemodynamically convolved sequences of unimodal (visual or auditory) or bimodal experimental conditions. Clusters were selected according to maximum spatial and temporal correlations with spatial templates of the two unimodal (bilateral auditory cortex, bilateral visual cortex) and bimodal candidate regions (posterior parietal cortex). This selection procedure yielded a single unique cluster for each of the unimodal sensory modalities and a single bimodal cluster that was correlated with unimodal as well as bimodal stimulus conditions (see Electronic Supplementary Material). We then computed intersections between these between-subject maps in order to define candidate regions for AV integration that served as ROIs for the analysis of experiment 2 (Fig. 1a, right column). For each cluster, the activation profiles of the clustered connectivity modes were averaged to obtain a group-level activation profile.

Hypothesis-testing analysis of ROIs in experiment 2

For the statistical analysis of experiment 2, we employed conventional hypothesis testing using multiple linear regression of voxel time courses of the ROIs as defined in experiment 1. For every voxel, the time course was regressed on a set of dummy-coded predictors representing the five experimental conditions. To account for the shape and delay of the hemodynamic response (Boynton et al. 1996), the predictor time courses (box-car functions) were convolved with a gamma function. We used group-based conjunction analyses (a fixed effects model with separate subject predictors) on the data of experiment 2, which were spatially restricted to the ROIs obtained on the basis of experiment 1 to effectively test the potential role of these ROIs in the context of AV object perception. More specifically, we employed the so-called max-criterion (i.e., AV > max[A, V]; e.g., Beauchamp 2005b) to test for multisensory integration defined as enhanced activation during bimodal stimulation. Although most widely used in neuroimaging analyses of multisensory integration, and therefore our choice in this context, its validity for computational and psychophysical research is debated (Angelaki et al. 2009). The ROI activation profiles were visualized using bar plots of the group-based regression coefficients (beta estimates) for each experimental condition.


Experiment 1 (sICA)

We detected three between-subject clusters that appeared to be involved in sensory processing (Fig. 3). Two of them reflected unisensory processing based on their respective prominent spatial coverage of unisensory cortices and associated time courses. The spatial distribution of the auditory cluster, which ranked third in the intra-cluster similarity rating (Fig. 3a; FDR-corrected visualization threshold, t = 2.74) included superior–lateral parts of the temporal lobes. The right-hand panels of Fig. 3 show the time courses of the respective IC clusters. The time course peaked on blocks of auditory as well as AV stimulation. The spatial distribution of the visual cluster, which ranked second (Fig. 3b; corrected visualization threshold t = 2.54) included bilateral occipital and posterior parietal cortices. The associated time course peaked on blocks of visual as well as AV stimulation. Finally, only one of the lower-ranked IC clusters (ranked 7th; corrected visualization threshold t = 3.44) showed a prominent spatial distribution and time course that could be associated to AV processing (Fig. 3c) and contained bilateral posterior parietal and left prefrontal cortex. The temporal associations of these three IC clusters to the experimental paradigm were further quantified by submitting the component time courses to a GLM with contrasts testing for both auditory or visual modality preference and AV integration (see Table 1 for statistical parameters). This analysis confirmed our tentative characterization of these IC clusters as auditory, visual, and AV, respectively.

Fig. 3
figure 3

Independent component IC clusters of interest. Three IC cluster maps with activations in predominantly auditory (a), visual (b), and heteromodal (c) cortices are shown with their respective averaged time courses. Data are projected on group-averaged anatomical images according to neurological convention, with Talairach coordinates (x, y, and z) for the main cluster in view. Left hemisphere depicted on the left of image. Graphs in the middle show the respective component time courses against the background of the experimental conditions. Graphs on the right show the time courses averaged over blocks of the same condition (twelve time points, starting from the start of the block)

Table 1 Characterization and selection of independent components (ICs)

We then identified regions of overlap between these group-level connectivity maps (Table 2) in order to define a set of candidate ROIs potentially involved in AV integration. This resulted in a network of ROIs including bilateral superior temporal (pSTS), ventral occipito-temporal (VOT), ventro-medial occipital (VMO), posterior parietal (PPC), and prefrontal cortex (PFC) as well as left auditory (AC) and dorsal pre-motor cortex (dPMC) (Table 3).

Table 2 Experiment 1: Regions of overlap between IC cluster maps 2 (visual), 4 (auditory), and 7 (AV)
Table 3 Functional activation profiles of ROIs in experiment 2

Experiment 2 (ROI-based analysis)

The AV candidate regions defined on the basis of experiment 1 served as ROIs for the analysis of experiment 2. Applying the max-criterion for AV integration (i.e., AV > max[A, V]; e.g., Beauchamp 2005b), we revealed integrative activation profiles (Fig. 4; for statistical parameters see Table 3) for the highly incongruent AV stimulation in bilateral pSTS (left; t = 4.4, P < 0.000, right; t = 3.6, P < 0.000), VOT (left; t = 4.2, P < 0.000, right; t = 3.5, P = 0.001), VMO (left; t = 2.7, P = 0.007, right; t = 2.7, P = 0.006), and PFC (left; t = 3.6, P < 0.000, right; t = 3.5, P < 0.000) as well as in left PPC (t = 2.8, P = 0.005). Only a subsample of these, namely bilateral pSTS (left; t = 5.5, P < 0.000, right; t = 3.5, P < 0.000), VOT (left; t = 2.9, P = 0.004, right; t = 2.7, P = 0.008), left VMO (t = 2.5, P = 0.012) and left PPC (t = 2.0, P = 0.049) also met the criterion during incongruent same-category stimulation. Only the left pSTS (t = 2.8, P = 0.005) and left VOT (t = 2.8, P = 0.005) ROIs were found to meet the max-criterion during each type of AV stimulation including semantically congruent stimuli in particular.

Fig. 4
figure 4

Experiment 2: explicit statistical testing of hypothesized AV convergence regions. GLM-based group results of experiment 2 are shown for nine regions-of-interest (ROIs) as defined in experiment 1. Only ROIs are shown that met the max-criterion (i.e., AV > max[A, V]) for at least one of three AV conditions. The middle column shows the respective ROIs (colored in green) as projected on group-averaged anatomical data. The left and right columns depict the respective functional activation profiles of these ROIs by providing the GLM beta estimates for each experimental condition. Asterisks indicate the least significance level of all significant max-contrasts (*<0.05; **<0.01; ***<0.005)

GLM-based ROI definition (experiment 1) and analysis (experiment 2)

For comparison with the ICA-based ROI analysis, data of experiment 1 were also analyzed using a conventional whole-brain GLM, in which AV integration maps were computed using the max-criterion (AV > max[A, V]; t = 3.25 P < 0.05, cluster-size corr.; estimated cluster-size threshold = 281 voxels). Similar to the ICA approach, we corrected the GLM estimates for multiple comparisons using the FDR (q = 0.05). This procedure did not provide any significant results, which suggested that the ICA method had greater power in localizing candidate ROIs.

We followed up on this result by comparing the GLM and ICA methods in more detail. Direct comparison of the results of these methods is not a trivial issue because the underlying data come from different distributions (i.e., beta coefficients from time course analysis of the GLM and multivariate estimates from ICA). However, in both methods, the final statistical test is performed on the subject-level, with the GLM as well as the sogICA method culminating in a T test across participants. Thus, we compared the P values of the GLM and ICA results in two situations. Firstly, we equalized the number of visualized voxels of the GLM-estimated results to those of the ICA-based results and ascertained the visualization threshold and spatial overlap of the equalized GLM map with the ICA-based map. Spatial overlap was calculated as the proportion of overlapping voxels of the total amount of GLM voxels. These procedures resulted in a minimum visualization threshold of the GLM map of P = 0.012, uncorrected, which showed an overlap with the ICA-based map of 4.23%. Second, we applied cluster-size correction as an alternative method for multiple comparison correction (Forman et al. 1995). This procedure yielded three voxel clusters (compared to 9 ICA-based ROIs) that overlapped with the ICA-based ROIs (see Electronic Supplementary Material for further details). Thus, both post hoc comparisons between the two analysis methods showed a higher detection power for the ICA-based method.


Group ICA of an AV fMRI data set allowed us to define an exclusive set of cortical candidate regions for AV integration from uni- and multisensory connectivity networks. An independent follow-up experiment further confirmed AV convergence in these regions. While left pSTS and VOT regions were found to integrate auditory and visual stimuli largely irrespective of their particular semantic relationship, PPC and PFC regions showed a parametric sensitivity to semantically incongruent AV stimuli. We thus showed and validated sensory convergence in functional networks of uni- and multisensory brain regions. In the following paragraphs, we first discuss these findings with regard to their potential implications for our understanding of multisensory object perception and discuss the possible methodological implications for multisensory neuroimaging research.

The human cortical network for object-related AV convergence

While the auditory and visual connectivity maps of experiment 1 (Fig. 3a, b) showed predominantly unisensory spatial activation patterns at least at a general level, both also included cortical regions belonging to ‘unisensory’ cortices traditionally designated to the processing of the other sensory modality. This might contribute to multisensory interactions observed at lower levels of the cortical processing hierarchy that have been reported based on a variety of methodologies ranging from invasive electrophysiology in non-human primates to human neuroimaging approaches enabling either temporal or spatial high-resolution measurements (Belardinelli et al. 2004; Baier et al. 2006; Martuzzi et al. 2007; Meienbrock et al. 2007; Eckert et al. 2008; see Driver and Noesselt 2008 for a recent review) and manipulation of an additive factor such as temporal correspondence (Noesselt et al. 2007).

In classical physiological studies, another criterion for multisensory integration findings is superadditivity, where the response to bimodal stimuli exceeds the sum of the responses to the unimodal stimuli. So far, only few fMRI studies managed to obtain such an effect. The lack of such a strong difference in this study may be due to several reasons. This study used optimal stimuli, whereas degraded stimuli, in accordance with the inverse-effectiveness principle (Stein and Meredith 1993) can evoke stronger multisensory integration responses (see e.g., Stevenson et al. 2007). The spatiotemporal alignment of the auditory and visual stimulation was found to be another factor of importance in this regard (Werner and Noppeney 2010b). Additionally, the usage of an additive factor in the design may increase sensitivity to superadditive responses (Stevenson et al. 2009).

Group ICA of experiment 1 revealed multisensory candidate regions in left VOT and pSTS that demonstrated robust AV convergence effects during experiment 2, irrespective of the exact semantic relation between the auditory and visual stimulus components. While activations of pSTS and neighboring regions belong to the most frequently reported fMRI findings regarding AV integration (Beauchamp et al. 2004; van Atteveldt et al. 2004; Hein et al. 2007; Naumer et al. 2009; Werner and Noppeney 2010a; see also Doehrmann and Naumer 2008 for a recent review), the significance of these findings has recently been questioned (Hocking and Price 2008) and the exact role of this particular region still remains under debate. Please note that our conjunction-of-contrasts approach (i.e., the max-criterion) is conceptually similar to whole-brain analyses in previous multisensory fMRI studies (e.g., Beauchamp et al. 2004; van Atteveldt et al. 2004), which first calculated the overlap of unisensory maps as a way to map candidate sites for multisensory integration and subsequently performed comparisons between bi- versus unimodal experimental conditions within these candidate regions. The findings from our GLM-based whole-brain analysis correspond to the finding of AV integration in these studies.

All three PFC and PPC ROIs appeared to be more strongly activated (but this was not supported by a post hoc ANOVA, F = 0.1352, P > 0.05) when the stimuli in the two modalities were semantically incongruent and parametrically increasing the level of semantic incongruency (i.e., the conceptual distance between the auditory and visual stimulus components). This suggests that the fronto-parietal network is likely concerned with higher-level (cognitive rather than perceptual) AV processing, when a certain amount of stimulus abstraction has already been achieved (van Atteveldt et al. 2004; Hein et al. 2007; Doehrmann and Naumer 2008; Naumer et al. 2009; Werner and Noppeney 2010a). An effective connectivity study (Noppeney et al. 2008) using a crossmodal priming paradigm has shed some light on the (hierarchical) roles of these congruency-sensitive integration sites, suggesting that their activation during incongruent stimulation constitutes unsuppressed input from low-level regions. Another distinction of several integrative regions can be made on the basis of what stimulus types affect them, as shown in a study in which irrelevant auditory cues affected the perception and processing of visual motion stimuli (Sadaghiani et al. 2009).

How ICA-based analysis can contribute to multisensory fMRI research

We employed a two-step fMRI data analysis approach to investigate object-related AV convergence in human cerebral cortex. This approach combined hypothesis-generating ICA used to define a widely distributed set of AV candidate regions (experiment 1) and the hypothesis-testing GLM as employed to explicitly test the hypothesized sites of AV convergence using established statistical criteria (experiment 2). Even though there is a continuing debate about both the inclusion and the particular roles of diverse brain regions (Calvert 2001; Beauchamp 2005b; Hocking and Price 2008; Stevenson et al. 2009), there is a growing consensus that object-related multisensory integration critically involves distributed processing, presumably within a multi-level hierarchy of brain regions (Amedi et al. 2005; Doehrmann and Naumer 2008; Driver and Noesselt 2008; Naumer and Kaiser 2010). The use of sICA appears to be of particular value for human multisensory research, as it provides a robust non-invasive measure of neural coactivation. The use of an IC grouping method, such as the hierarchical clustering method applied here, does not only facilitate the generalization to the population level but also precludes the potential effects of local minima in ICA (Himberg et al. 2004). However, as sICA is a data-driven approach, which can be used for the generation of specific hypotheses (Castelo-Branco et al. 2002), it is recommended to complement it by explicit statistical hypothesis testing based on independent data. Interestingly, increased attention to data-driven methods such as sICA has already been given in the context of complex and ecologically valid environmental perception (van de Ven et al. 2004, 2008, 2009; Bartels and Zeki 2005; Esposito et al. 2005; Malinen et al. 2007) of which multisensory object perception can be regarded as another prominent example. In addition, the clustering approach of single-subject decompositions within the sogICA framework essentially provides a random effects approach that is similar to its GLM-based counterpart and allowed us to compare their detection power in our study. Thus, we are confident that independent statistical testing of hypotheses generated using sICA might provide important results for the debate on rivaling models of multisensory integration in the human brain.

We demonstrated that sICA is able to effectively reveal a comprehensive ensemble of candidate regions for AV convergence. These are less likely detected in whole-brain GLM contrasts (e.g., AV > max[A, V]) such as the one we computed and reported here for comparison (see Electronic Supplementary Material for details). An attempt to compare our two-step method directly with a classical whole-brain GLM approach resulted in a lack of results for the latter when using the similar correction criterion of FDR, and a disadvantage in detection (three sites as compared two nine) and specificity (only one of the whole-brain GLM detected regions showed a significant ROI-based integration effect) when using the more liberal threshold. While multivariate sICA also allows the detection and removal of typical fMRI-related artifacts (Thomas et al. 2002; Liao et al. 2006; see also Electronic Supplementary Material) its increased sensitivity in the detection of functionally coupled multisensory networks is mainly due to the fact that this method makes implicit use of functional connectivity information in the data via its one time course per map constraint.

Potential limitations and future directions

This study aimed at comprehensively revealing the human cortical network involved in object-related AV integration. As the experimentally manipulated dimension of integration—semantic congruency—could not be directly compared to multisensory convergence based on spatio-temporal proximity, we were not able to differentiate further between diverse hierarchical levels of multisensory convergence. In order to achieve a more precise functional characterization of the reported clusters in unisensory cortices, future studies should include topographic (i.e., tonotopic and retinotopic) mappings. Finally, measurements of effective connectivity, as provided by methods such as dynamic causal modelling (DCM; Friston et al. 2003; Werner and Noppeney 2010a) should enable the determination of interdependencies between the diverse components of the described cortical network.


The combination of hypothesis-generating group ICA and hypothesis-testing ROI-based GLM analysis of fMRI data allowed us to reveal the distributed cortical network of multisensory convergence regions involved in human AV object perception. Our findings support the assumption of a coordinated interplay between lower- and higher-level cortical regions specialized for distinct sub-processes of human AV object perception and demonstrate how sICA can be fruitfully applied in multisensory neuroimaging research.