Brain Imaging and Behavior

, Volume 11, Issue 1, pp 253–263 | Cite as

Decoding power-spectral profiles from FMRI brain activities during naturalistic auditory experience

Original Research

Abstract

Recent studies have demonstrated a close relationship between computational acoustic features and neural brain activities, and have largely advanced our understanding of auditory information processing in the human brain. Along this line, we proposed a multidisciplinary study to examine whether power spectral density (PSD) profiles can be decoded from brain activities during naturalistic auditory experience. The study was performed on a high resolution functional magnetic resonance imaging (fMRI) dataset acquired when participants freely listened to the audio-description of the movie “Forrest Gump”. Representative PSD profiles existing in the audio-movie were identified by clustering the audio samples according to their PSD descriptors. Support vector machine (SVM) classifiers were trained to differentiate the representative PSD profiles using corresponding fMRI brain activities. Based on PSD profile decoding, we explored how the neural decodability correlated to power intensity and frequency deviants. Our experimental results demonstrated that PSD profiles can be reliably decoded from brain activities. We also suggested a sigmoidal relationship between the neural decodability and power intensity deviants of PSD profiles. Our study in addition substantiates the feasibility and advantage of naturalistic paradigm for studying neural encoding of complex auditory information.

Keywords

Power-spectral profile fMRI brain decoding Auditory intensity-encoding Frequency-encoding Naturalistic paradigm 

Introduction

Exploring neural dynamics during real-world experience has gained increasing interest in recent years (Hasson and Honey 2012; Spiers and Maguire 2007; Hasson et al. 2004; Nishimoto et al. 2011; Alluri et al. 2012; Huth et al. 2012; Nardo et al. 2011; Bartels and Zeki 2005; Bordier et al. 2013). It has been argued that “the ecological validity of any finding discovered within a controlled laboratory setup is not clear until tested in real-life contexts” (Hasson and Honey 2012). However, complex naturalistic stimuli introduce challenges in quantitative modeling of the inputs, and consequently make it difficult to infer brain-behavior relationships using traditional hypothesis-based methods (e.g., general linear model, GLM) (Bordier et al. 2013). One promising solution to this challenge is borrowing perceptually relevant acoustic (Alluri et al. 2012; Cong et al. 2013; Toiviainen et al. 2013; Santoro et al. 2014) and visual descriptors (Nardo et al. 2011; Bordier et al. 2013; Nishimoto et al. 2011; Hu et al. 2015) that are widely used in audio and video processing communities. It provides a practical mapping from input space to feature space (Naselaris et al. 2011).

Recent neuroimaging studies have demonstrated a close relationship between computational acoustic descriptors and brain activities during naturalistic auditory experience. For example, Alluri et al. correlated acoustic features related to timbral, rhythmical and tonal properties of music to fMRI brain activities, and showed the large-scale brain circuitry dedicated to acoustic feature processing (Alluri et al. 2012). Toiviainen et al. reported that timbral and rhythmic features but not tonal features can be accurately decoded from fMRI brain activities in the auditory cortex, cerebellum and hippocampus (Toiviainen et al. 2013). Alluri et al. showed that the temporal evolution brain activities can be predicted by musical features in a wide range of cortex (Alluri et al. 2013). Other studies also showed that it is possible to decode acoustic categories and auditory saliency from fMRI brain activities (Klein and Zatorre 2015; Kumar et al. 2014; Fang et al. 2015; Ji et al. 2015; Zhao et al. 2014; Han et al. 2015).

Surprisingly, as one of the most basic descriptors in acoustic information processing, power spectral density (PSD) has been rarely explored in the aspect of how it correlates to brain activities. PSD depicts the power intensity distribution of temporal signals in frequency domain (Proakis and Manolakis 1992). Thus, we expect that correlating PSD profiles with brain activities during naturalistic experience may provide novel insights into auditory intensity-encoding and frequency-encoding in the human brain. For example, the naturalistic paradigm used in this study is free from any specific task instructions and demanding, and consequently it may eliminate the discrepancies about neural auditory intensity-encoding caused by the difference in task designs such as comparison or categorization of auditory intensity (Angenstein and Brechmann 2015). In addition, the naturalistic paradigm is equipped with enriched stimulus types such as male narrations, musical and environmental background, and dialogs among characters in different genders and ages. It is expected that if the brain responses are capable of differentiating PSD profiles in multiple types, the identified discriminative brain responses are unbiased by stimulus types. Thus, the naturalistic paradigm provides the possibility to alleviate the discrepancies caused by the difference in stimuli types (e.g., frequency modulated tones and speech) (Angenstein and Brechmann 2015; Reiterer et al. 2008).

The purpose of this study is to examine whether PSD profiles can be decoded from fMRI brain activities during naturalistic auditory experience, and to explore how the neural decodability quantitatively correlates with intensity and frequency differences of the PSD profiles. We adopted a public high resolution fMRI dataset which were acquired when the participants freely listened to the audio-description of the movie “Forrest Gump” (Hanke et al. 2014). The representative PSD profiles existing in the audio-movie were identified by the clustering of the audio samples according to their PSD descriptors. SVM classifiers were trained to decode the PSD profiles using corresponding brain activities. We quantitatively measured the neural decodability of PSD profiles by the volume of discriminative voxels and decoding accuracy. Our experimental results showed that PSD profiles can be reliably decoded from fMRI brain activities. We also suggested a sigmoidal relationship between the neural decodability and power intensity difference of PSD profiles.

Materials and methods

Dataset and preprocessing

The dataset used in this study are available at: http://studyforrest.org/ (Hanke et al. 2014). We briefly describe the dataset as follows. Twenty right-handed participants (age 21–38 years, mean age 26.6 years, 12 male) whose native language is German were recruited. All of them were reported to have normal hearing and no history of neurological disorders. The German audio-description of the movie “Forrest Gump” was used as a naturalistic paradigm for fMRI scan. The audio-description is particularly designed for visually impaired audiences, and its content is largely identical to the soundtrack of the movie in German except for interspersed male narrations which describes the visual content of a scene when there is no relevant audio-content in the movie. The audio stream was processed by a series of filters (Hanke et al. 2014) to achieve optimal saliency of the audio stimulus with regard to the MRI scanner noise.

The audio-description was divided into 8 segments. FMRI data were acquired in two separate sessions. Each session consisted of 4 runs. In each run, one of the segments was presented to the participants using MRI compatible in-ear headphones. T2*-weighted gradient-echo EPI were acquired using a 7 Tesla Siemens MAGNETOM scanner with a 32 channel brain receive coil. Axial slices were oriented to include the ventral portions of frontal and occipital cortex. 36 slices were recorded. The scanning parameters were as follows: repetition time (TR) = 2000 ms, echo time (TE) = 22 ms, echo spacing =0.78 ms, isotropic voxel size 1.4 mm, field-of-view (FOV) = 224 mm. In total, 3599 volumes were recorded for each participant in the 8 runs (451, 441, 438, 488, 462, 439, 542, and 338 volumes for runs 1–8, respectively). High-resolution T1- and T2-weighted structural images for structural reference were acquired using a 3D turbo spin echo (TSE) sequence. The parameters were as follows: TR = 2500 ms and TE = 5.7 ms for T1-weighted image, TR = 2500 ms and TE = 230 ms for T2-weighted image, FoV = 191.8 × 256 × 256 mm, 274 sagittal slices, in-plane matrix size 384 × 384. FMRI data are available for 18 out of the 20 participants.

The released dataset were skull removed, motion- and distortion corrected. A group-specific EPI template was included. All fMRI data were aligned to the EPI template. For each run, the average volume was computed and aligned to the group-specific EPI template through an affine transformation estimated by FSL FLIRT (Jenkinson et al. 2012). In addition, the affine transformation was combined with a non-linear warping estimated by FSL FNIRT (Hanke et al. 2014). We performed additional pre-processing including slice-timing correction and high-pass filtering with a 128 s cut-off. No spatial smoothing was applied. The fMRI signal for each voxel was z-scored in each run separately.

Welch’s PSD descriptor

We introduce the Welch PSD descriptor to model the power spectral profiles of the complex naturalistic auditory stimuli, severing as a practical mapping from the input space to feature space for fMRI brain decoding. In the literature, parametric and nonparametric methods for power spectral estimation (PSE) have been proposed. Nonparametric methods do not make any assumptions on the data (Proakis and Manolakis 1992). Considering naturalistic audio signals are typically modeled as random processes, we used the Welch PSE (Welch 1967), which is one of the most popular nonparametric PSE methods. In Welch PSE, the signal is divided into K segments. By allowing the segments to overlap, it eliminates the tradeoff between the variance and spectral resolution. The classical periodogram (Proakis and Manolakis 1992) is then applied to each of the K segments separately. The final power spectrum is the average of the estimated power spectrum for the K segments. There are a few parameters in Welch PSE, including the shape and the length of the window, and the overlap rate between windows. We used a common practice of Hamming window in our study. The length of the window was 64 and the overlap rate was 50 %. A 129-dimensional PSD descriptor was estimated for each of the 2 s audio samples (corresponding to an fMRI volume).

Identification of representative PSD profiles

In order to identify representative PSD profiles in the stimuli, the PSD descriptors of all the audio samples were grouped into clusters based on affinity propagation (AP) algorithm (Frey and Dueck 2007). AP takes the pair-wise similarity matrix as input and determines the number of clusters automatically. In order to increase the homogeneity of the resulted PSD clusters and to identify PSD clusters with distinguishable patterns, we adopted a two-stage clustering strategy. In the first stage, AP clustering with a large cluster number was performed. Afterwards, we calculated the average intra-cluster distance as the mean of the pair-wise Euclidean distances of all the instance-pairs for each cluster. The average intra-cluster distances were then z-scored and those clusters with z-score above 1.65 were discarded as noise. The centers of those preserved clusters were subjected to the second stage AP clustering. After the two-stage AP clustering, a cluster label was assigned to each audio sample to represent its PSD profile.

PSD profile decoding

To examine whether PSD profiles were decodable and to quantitatively explore how their neural decodability correlated to power intensity and frequency difference, we designed binary classification tasks on paired clusters. In each binary classification task, a classifier based on SVM (Cortes and Vapnik 1995) was trained to differentiate the PSD profiles of a cluster-pair using corresponding brain activities (fMRI volumes). The hemodynamic response function (HRF) introduces lag of brain activities in fMRI data. A typical HRF has the peak at 5 s and the undershoot at 15 s. To compensate for the time lag introduced by HRF, we temporally shifted the time course of the labels of PSD profiles 2 TRs (4 s) behind (Nishimoto et al. 2011).

Voxel selection is essential in fMRI decoding due to the high spatial resolution of the fMRI data (Norman et al. 2006; Naselaris et al. 2011). We adopted a two-stage voxel selection strategy. In the first stage, the voxels were filtered by inter-subject correlation (ISC) to identify consistent brain activities across participants (Hasson et al. 2004; Toiviainen et al. 2013). For each voxel, ISC was calculated as the mean Pearson correlation coefficient between the fMRI signals of all possible participant-pairs. A significance (p < 0.05, FDR corrected) threshold was inferred via permutation test (Kauppi et al. 2014). ISC map and significance threshold were separately calculated for each of the 8 runs. Then we derived an average ISC map and an average threshold over the 8 runs. Voxels with significant ISC were preserved and subjected to the second-stage voxel selection. In the second-stage voxel selection, the fMRI time courses of each voxel in the 18 participants was averaged. Supervised voxels selection based on two-sample t-test was performed to identify discriminative (p < 0.05, FDR corrected) voxels in each binary classification task. The number of discriminative voxels was used a quantitative measure to the neural decodability.

A SVM classifier with linear kernel was trained to differentiate the labels of a paired PSD clusters using the selected brain activities. We performed a principal component analysis (PCA) to remove the redundancy existing in the selected brain activities and to further reduce the data dimension. To avoid the ‘peeking’ problem in brain decoding (Naselaris et al. 2011), n-fold cross-validation was adopted. In each fold, the supervised t-test voxel selection, PCA-based redundancy removing and classifier training were performed on the n-1 folds of training samples, and the classifier was tested on the rest 1 fold of testing samples. The overall decoding accuracy was measured as the average classification accuracy in cross-validations, and used as another quantitative measure to the neural decodability.

Results

Exemplar PSD descriptors

We show some exemplar PSD descriptors in this experiment to show the possibility of identifying representative PSD profiles by PSD descriptor clustering. Figure 1 shows three exemplar audio segments (Fig. 1a) and corresponding PSD descriptors (Fig. 1b). The audio samples are about male narrations, piano background and child Gump speaking, respectively. The PSD descriptors of the three samples are quite different. The male narrations sample has higher overall power intensity and the power focuses on lower frequency. The frequency band of the piano background sample is broader than that of the male narrations one, however, the overall power is much lower. The child Gump speaking sample has moderate power intensity and the broadest frequency band. These differences provide discriminative information for PSD descriptor clustering.
Fig. 1

a Exemplar audio samples about male narrations (top), piano background (middle) and child Gump speaking (bottom). b The corresponding PSD descriptors

Representative PSD profiles

In our study, the representative PSD profiles were identified by a two-stage AP clustering of the audio samples according to their PSD descriptors. AP clustering algorithm has but one parameter, that is, the preference ‘p’. p is a real-valued vector, where p(i) indicates the preference that sample i be chosen as an exemplar. In our study, as a common practice we used default p, which is the median of the input sample-wise similarity matrix. The first stage of AP clustering resulted in 147 clusters. We show the pair-wise distance matrix in Fig. 2(a). While the majority of the clusters are homogenous (low intra-cluster distance, dark blue and blue regions along the diagonal in Fig. 2a), relatively high intra-cluster distances are observed in some small clusters (as indicated by the thin red ribbons). 12 of the 147 clusters were with z-scored intra-cluster distance above 1.65. They were treated as outliers and removed. The centric instances of the preserved 135 clusters were subjected to the second stage AP clustering, resulting in 15 clusters. Both the distance matrix (Fig. 2b) and the PSD patterns (Fig. 2c) showed that AP clustering achieved distinguishable PSD profiles with good homogeneity, expect for clusters No. 13, 14 and 15. Those three clusters were discarded and we focused on the remaining 3021 samples in 12 clusters in the following decoding study. Both power intensity difference and frequency distribution difference were observed in the 12 clusters. Thus, we grouped them into 3 groups according to their frequency distributions, such that the clusters in each group have similar frequency distributions but different power intensity levels, as shown in Fig. 2(d-f).
Fig. 2

The identification of representative PSD profiles based on two-stage AP clustering. a The pair-wise distance matrix after the first stage AP clustering. b The pair-wise distance matrix after the second stage AP clustering. c The final 15 PSD clusters. In each sub-figure, the color lines are PSD samples, the thick black line is the cluster center and the number in the brackets is the number of samples. d-f The 12 selected clusters were divided into 3 groups according to their frequency distributions

ISC-based brain activity selection

We used inter-subject correlation (ISC) (Kauppi et al. 2014) to identify consistent brain activities across participants. Figure 3 (a) shows the ISC map. Several brain regions were with significant (p < 0.05, FDR corrected) ISC, including bilateral STG (superior temporal gyrus), MTG (middle temporal gyrus), HG (Heschl’s gyrus), the Broca’s area, and the default mode network (DMN). This observation is in line with typical ISC maps during naturalistic audio listening in previous studies (Trost et al. 2015; Farbood et al. 2015; Abrams et al. 2013). In our study, 53,881 voxels with significant ISC were preserved, and subjected to supervised discriminative voxels identification based on t-test.
Fig. 3

Brain areas with significant (p < 0.05, FDR corrected) inter-subject correlation (ISC). Arrows highlight brain areas. Blue: STG; cyan: MTG; black: HG; green: DMN; yellow: the Broca’s area

Discriminative voxels and decoding accuracy

In total, 66 paired t-tests were performed to identify discriminative voxels for all the binary classification tasks among the 12 clusters. Discriminative voxels were successfully identified for 45 out of the 66 binary classification tasks (as shown in Fig. 4a), demonstrating the possibility of decoding PSD profiles from brain activities. We used 6-folds cross-validation to examine the decoding accuray, as shown in Fig. 4(b). In the 45 decodable cluster pairs, 16 (35.56 %) of them were with decoding accuracy over 0.7. The lowest decoding accuracy was cluster 2 against cluster 8 (51.70 %), and the highest decoding accuracy was cluster 4 against cluster 7 (90.04 %). Both the number of discriminative voxels and decoding accuracy varied in a wide range across classification tasks. This observation provides preliminary evidence to that the neural decodability of PSD profiles may be highly correlated to the underlying differences in power intensity and frequency distribution.
Fig. 4

a Number of discriminative voxels. b Decoding accuracy. When a cluster-pair is not differentiable, the decoding accuracy was set as 0.5. For better visualization, we only show the upper triangular part of the symmetric matrix

Neural decodability of PSD profiles

Both power intensity deviant and frequency distribution deviant contribute to the difference in the evoked brain activities (Röhl and Uppenkamp 2012; Uppenkamp and Röhl 2013). In the 3 groups of PSD profiles shown in Fig. 2(d-f), the clusters in each group have similar frequency distributions but different power intensity levels. Thus, the neural decodability of the intra-group cluster-pairs is dominated by power intensity difference, while that of the inter-group cluster-pairs might be determined by both power intensity and frequency difference. Accordingly, we divided the 66 cluster pairs into 4 groups, denoted as Gpow, Gfreq, Gboth and Gneit. Gpow included all the possible intra-group cluster-pairs. In addition, the cluster 7 had extreamly low power intensity. Figure 4 showed that it was highly differentiable against other clusters. Thus the inter-group cluster-pairs related to cluster 7 were merged into Gpow. Gneit included all the undecodable inter-group cluster-pairs. The decodable inter-group cluster-pairs were further diveded into Gfreq and Gboth by manual inspections. For example, cluster 1 and 3 were not differentiable, and the power intensity difference between cluster 1 and 8 was smaller than that between cluster 1 and 3. Consequently, the decodability of cluster 1 against 8 was dominated by frequency difference. Cluster 1 and 2 were differentiable, and the power intensity difference between cluster 1 and 9 was larger than that between cluster 1 and 2. Then the decodability of cluster 1 against cluster 9 was determined by both power intensity and frequency difference. As sumarrized in Table 1, there were 30, 15, 13 and 8 cluster pairs in Gpow, Gfreq, Gboth and Gneit, respectively.
Table 1

Cluster-pairs whose neural decodability was determined by power intensity difference (Gpow), frequency difference (Gfreq), and both power intensity and frequency difference (Gboth). Gneit denotes the undecodable inter-group cluster-pairs

Gpow (30)

Gfreq (15)

Gboth (13)

Gneit (8)

(1 2), (1 3), (1 6), (1 7), (1 10), (1 12), (2 3), (2 6), (2 7), (2 10), (2 12), (3 6), (3 7), (3 10), (3 12), (6 7), (6 10), (6 12), (10 12), (4 5), (4 7), (4 8), (5 7), (5 8), (7 8), (7 9), (7 10), (7 11), (7 12), (9 11)

(1 4), (1 8), (1 11), (2 11), (2 8), (3 4), (3 5), (3 8), (3 9), (3 11), (4 10), (4 12), (5 12), (8 12), (11 12)

(1 5), (1 9), (2 4), (4 6), (4 9), (5 6), (5 10), (6 8), (6 11), (8 10), (9 10), (9 12), (10 11)

(2 5), (2 9), (4 11), (5 9), (5 11), (6 9), (8 9), (8 11)

Decodability vs. power intensity difference

In this experiment, we focused on Gpow to explore the relationship bewteen neural decodability and power intensity difference of PSD profiles. Figure 5(a) provides the scatter plot (green circles) of the number of discriminative voxels agaisnst power intensity difference. 17 out of the 30 cluster-pairs in Gpow were decodable. From the scatter plot, we observed a hyperbolic tangental relationship between decodability and power intensity difference. This observation was verified by fitting the scatter plot with a hyperbolic tangent function y = a ⋅  tanh (kx + b) + c (the blue curve, R2 > 0.95). Figure 8(c) depicts the scatter plot and curve fitting (R2 > 0.97) of decoding accuracy against power intensity difference. Similarly, a hyperbolic tangental relationship was observed. Combining the number of discriminative voxels and decoding accuracy, our results suggest a sigmoidal relationship between the neural decodability and power intensity difference of PSD profiles, with a response threshold around 4 dB and a saturation effect around 12 dB. Small power intesnity difference (less than 4 dB) is barely differentiable by brain activities. As the power intensity difference increases, the neural decodability improves but reaches the maximum when the difference is beyond 12 dB. In Fig. 6, we show the spatial distribution of the discriminative voxels in Gpow to explore the brain regions that are strongly tuned to power intensity difference. The map indictes how many cluster-pairs that a voxel was discriminative for. Consistent discriminative brain regions (red and dark read regions) include bialateral HG, STG and PT, indicating those brain areas dominate the neural encoding of power intensity difference.
Fig. 5

a The scatter plot and hyperbolic tangent function curve fitting (R2 > 0.95) of the number of discriminative voxels against power intensity difference. b The scatter plot and curve fitting (R2 > 0.97) of the decoding accuracy against power intensity difference

Fig. 6

The spatial distribution of the discriminative voxels in Gpow

Decodability vs. frequency difference

In this experiment, we focused on Gfreq to explore the relationship between neural decodability and frequency difference of PSD profiles. Figure 7 depicts the scatter plot of the number of discriminative voxels and decoding accuracy against frequency difference. The differences of the neural decodability between cluster pairs with low (124.5 Hz) and high (311.3 Hz) frequency deviants were not significant. No obvious quantitative relationship was observed. However, the decoding accuracy in Gfreq (62.27 ± 7.41 %) was significantly (p < 0.001) lower than that in Gpow (73.74 ± 11.64 %). This observation may suggest that the neural decodability of power intensity difference is higher than that of frequency difference during naturalistic auditory experience. The spatial distribution of the discriminative voxels in Gfreq (Fig. 8) was much scattered compared with that in Gpow. Relatively consistent discriminative brain regions included the temporooccipital part of MTG, pars opercularis in IFG (part of the Broca’s area), and part of HG and PT.
Fig. 7

The scatter plot of the number of discriminative voxels (a) and decoding accuracy (b) against frequency difference

Fig. 8

The spatial distribution of the discriminative voxels in Gfreq

Decdoability vs. power and frequency difference

In this experiment, we focused on Gboth to explore how power intensity and frequency difference jointly influenced the neural decodability of PSD profiles. Figure 9 shows the bar plots of the number of discriminative voxels and decoding accuracy against power and frequency difference in Gboth. We obsverved that when the frequency difference was fixed, the decoding accuracy increased with the increase of power intensity difference. The spatial discribution of the discriminative voxles in Gboth was shown in Fig. 10. Consistent discriminative brain regions included posterior part of HG and PT. These observations indicate that power difference may dominate the neural decodability of PSD profiles with both power intensity and frequency differences.
Fig. 9

The bar plot of the number of discriminative voxels (a) and decoding accuracy (b) against power intensity and frequency difference

Fig. 10

The spatial distribution of the discriminative voxels in Gboth

Discussions and Conclusions

In this paper, we have proposed a multidisciplinary study to examine whether PSD profiles, which is one of the most basic computational acoustic features, is decodable from fMRI brain activities when the participants were exposed to naturalistic auditory experience. The proposed method integrated PSD analysis of naturalistic auditory stimuli which severs as a practical mapping from input space to feature space, and brain decoding analysis which dedicates to the interpretation of the brain-behavior relationship. Based on the proposed method, we have explored the relationship between neural decodability and power and frequency difference of PSD profiles. Our experimental results showed that PSD profiles can be reliably decoded from fMRI brain activities. We suggested a sigmoidal relationship between the neural decodability and power intensity difference of PSD profiles. We also observed the dominance of STS, HG and PT in the neural encoding of power intensity difference during naturalistic auditory experience.

The dominance of STS, HG and PT in neural encoding of auditory intensity difference observed in our study is in line with existing fMRI studies based on task-paradigms (Angenstein and Brechmann 2015; Lasota et al. 2003; Mustovic et al. 2003; Opitz et al. 2002; Reiterer et al. 2008; Dykstra et al. 2012). Based on the various stimulus types (e.g., tone complexes, frequency modulated tones and unmodulated harmonic complexes) and task-designs (e.g., intensity comparison and intensity categorization) (Angenstein and Brechmann 2015; Lasota et al. 2003; Mustovic et al. 2003; Opitz et al. 2002; Reiterer et al. 2008; Dykstra et al. 2012), it was concluded that the tuning of HG, STG and PT to intensity deviations was irrespective of stimulus or task type (Angenstein and Brechmann 2015). In comparison, we used naturalistic auditory stimuli, which contains enriched stimulus types and explicitly eliminates any specific task demand. Thus, it provides an experimental environment with improved ecological validity.

Quantitative relationship between the neural decodability and power intensity difference has not been documented in the literature. We suggested a sigmoidal relationship with a response threshold around 4 dB and a saturation effect around 12 dB. This observation is partly supported by previous studies on sound intensity processing. For example, the saturation of the activation volume at the highest auditory intensity levels (Lockwood et al. 1999; Mohr et al. 1999; Bilecen et al. 2002) and the steep rise of the activation volume between 0 and 10 dB (Langers et al. 2007) have been reported in existing studies. In addition, several studies have reported non-linear relationship between the volume of activation and sound intensity levels (Uppenkamp and Röhl 2013; Röhl and Uppenkamp 2012). In addition, the sigmoid relationship revealed in our study may indicate that the recently advanced deep neural networks (Hinton 2002) with a sigmoid function can potentially improve the decoding performance.

No obvious relationship between neural decodability and frequency difference of the PSD profiles was observed in our study. The topographic organization of the human auditory cortex has been well documented (Talavage et al. 2004). Neighbouring frequencies are represented in topologically neighbouring regions in the auditory cortex. Thus, frequency differences presumably result in brain activities in different brain regions, rather than different activation volume or fMRI signal changes. The observation may also attributes to the limited range of effective frequency of the stimuli used in our study. The PSD profiles in Fig. 1(c) showed that the maximum effective frequency of the audio samples was around 2000 Hz, and the maximum frequency difference among PSD profiles was about 311.3 Hz. However, in typical tonotopicity studies, the frequency of the acoustic stimuli ranges from 200 Hz to 8000 Hz (Saenz and Langers 2014). The limited number of PSD profiles with frequency difference may prevent us from a reliable description of the quantitative relationship between neural decodability and frequency difference of PSD profiles. It is expected that PSD descriptors with fine-gained spectral resolution may alleviate this problem.

In conventional auditory cortex mapping based on functional brain imaging, the intensity of the sitimuli was typically fixed in frequency-encoding studies, while the frequency was typically fixed in intensity-encoding studies. Thus, a comprehensive auditory cortex mapping gives rise to high cost and difficulties in brian imaging acquisition since a large number of intensity-frequency combinations is necessary (Saenz and Langers 2014). In comparison, naturalistic audio streams are equipped with enriched combinations of frequency and intensity. Thus, we envision that future studies on auditory cortex mapping can benefit remarkably from naturalistic paradigm. Supportively, the multidisciplinary study proposed in this paper on the one hand substantiates the feasibility and advantage of naturalistic paradigm for studying neural encoding of complex auditory information, on the other hand, provides an analytical framework for future studies.

Notes

Compliance with ethical standards

Funding

This study was funded by National Natural Science Foundation of China (NSFC) 61103061, 61333017, 61473234 and 61522207, and the Fundamental Research Funds for the Central Universities 3102014JCQ01065.

Conflict of Interest

All co-authors have seen and agreed with the contents of the manuscript. We have no relevant conflicts of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

References

  1. Abrams, D. A., Ryali, S., Chen, T., Chordia, P., Khouzam, A., Levitin, D. J., & Menon, V. (2013). Inter-subject synchronization of brain responses during natural music listening. European Journal of Neuroscience, 37(9), 1458–1469.CrossRefPubMedPubMedCentralGoogle Scholar
  2. Alluri, V., Toiviainen, P., Jaaskelainen, I. P., Glerean, E., Sams, M., & Brattico, E. (2012). Large-scale brain networks emerge from dynamic processing of musical timbre, key and rhythm. NeuroImage, 59(4), 3677–3689.CrossRefPubMedGoogle Scholar
  3. Alluri, V., Toiviainen, P., Lund, T. E., Wallentin, M., Vuust, P., Nandi, A. K., Ristaniemi, T., & Brattico, E. (2013). From Vivaldi to Beatles and back: predicting lateralized brain responses to music. NeuroImage, 83, 627–636.CrossRefPubMedGoogle Scholar
  4. Angenstein, N., & Brechmann, A. (2015). Auditory intensity processing: categorization versus comparison. NeuroImage, 119, 362–370.CrossRefPubMedGoogle Scholar
  5. Bartels, A., & Zeki, S. (2005). Brain dynamics during natural viewing conditions - a new guide for mapping connectivity in vivo. NeuroImage, 24(2), 339–349.CrossRefPubMedGoogle Scholar
  6. Bilecen D, Seifritz E, Scheffler K, Henning J, AC S (2002) Amplitopicity of the human auditory cortex: an fMRI study. NeuroImage 17 (2):710–718.Google Scholar
  7. Bordier, C., Puja, F., & Macaluso, E. (2013). Sensory processing during viewing of cinematographic material: computational modeling and functional neuroimaging. NeuroImage, 67, 213–226.CrossRefPubMedGoogle Scholar
  8. Cong, F., Alluri, V., Nandi, A. K., Toiviainen, P., Rui, F., Abu-Jamous, B., Gong, L., Craenen, B. G. W., Poikonen, H., & Huotilainen, M. (2013). Linking brain responses to naturalistic music through analysis of ongoing EEG and stimulus features. Multimedia IEEE Transactions on, 15(5), 1060–1069.CrossRefGoogle Scholar
  9. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.Google Scholar
  10. Dykstra, A. R., Koh, C. K., Braida, L. D., & Mark Jude, T. (2012). Dissociation of detection and discrimination of pure tones following bilateral lesions of auditory cortex. PloS One, 7(9), e44602.CrossRefPubMedPubMedCentralGoogle Scholar
  11. Fang, J., Hu, X., Han, J., Jiang, X., Zhu, D., Guo, L., & Liu, T. (2015). Data-driven analysis of functional brain interactions during free listening to music and speech. Brain Imaging and Behavior, 9(2), 162–177.CrossRefPubMedGoogle Scholar
  12. Farbood, M. M., Heeger, D. J., Marcus, G., Hasson, U., & Lerner, Y. (2015). The neural processing of hierarchical structure in music and speech at different timescales. Frontiers in Neuroscience, 9, 157.CrossRefPubMedPubMedCentralGoogle Scholar
  13. Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315(5814), 972–976.CrossRefPubMedGoogle Scholar
  14. Han, J., Chen, C., Shao, L., Hu, X., Han, J., & Liu, T. (2015). Learning computational models of video memorability from FMRI brain imaging. IEEE Trans. On Cybernetics, 45(8), 1692–1703.CrossRefGoogle Scholar
  15. Hanke, M., Baumgartner, F. J., Ibe, P., Kaule, F. R., Pollmann, S., Speck, O., Zinke, W., & Stadler, J. (2014). A high-resolution 7-tesla fMRI dataset from complex natural stimulation with an audio movie. Scientific Data, 1, 140003.CrossRefPubMedPubMedCentralGoogle Scholar
  16. Hasson, U., & Honey, C. (2012). Future trends in neuroimaging: neural processes as expressed within real-life contexts. NeuroImage, 62(2), 1272–1278.CrossRefPubMedPubMedCentralGoogle Scholar
  17. Hasson, U., Nir, Y., Levy, I., Fuhrmann, G., & Malach, R. (2004). Intersubject synchronization of cortical activity during natural vision. Science, 303(5664), 1634–1640.CrossRefPubMedGoogle Scholar
  18. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.CrossRefPubMedGoogle Scholar
  19. Hu, X., Lv, C., Cheng, G., Lv, J., Guo, L., Han, J., & Liu, T. (2015). Sparsity-constrained fMRI decoding of visual saliency in naturalistic video streams. Autonomous Mental Development, IEEE Transactions on 7, 2, 65–75.Google Scholar
  20. Huth, A. G., Nishimoto, S., Vu, A. T., & Gallant, J. L. (2012). A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron, 76(6), 1210–1224.CrossRefPubMedPubMedCentralGoogle Scholar
  21. Jenkinson, M., Beckmann, C. F., Behrens, T. E., Woolrich, M. W., & Smith, S. M. (2012). FSL. NeuroImage, 62(2), 782–790.CrossRefPubMedGoogle Scholar
  22. Ji, X., Han, J., Jiang, X., Hu, X., Guo, L., Han, J., Shao, L., & Liu, T. (2015). Analysis of music/speech via integration of audio content and functional brain response. Information Sciences, 297, 271–282.CrossRefGoogle Scholar
  23. Kauppi, J. P., Pajula, J., & Tohka, J. (2014). A versatile software package for inter-subject correlation based analyses of fMRI. Frontiers in Neuroinformatics, 8, 2.CrossRefPubMedPubMedCentralGoogle Scholar
  24. Klein, M. E., & Zatorre, R. J. (2015). Representations of invariant musical categories are decodable by pattern analysis of locally distributed BOLD responses in superior temporal and intraparietal sulci. Cerebral Cortex, 25(7), 1947–1957.CrossRefPubMedGoogle Scholar
  25. Kumar, S., Bonnici, H. M., Teki, S., Agus, T. R., Pressnitzer, D., Maguire, E. A., & TD, G. (2014). Representations of specific acoustic patterns in the auditory cortex and hippocampus. Proceedings Biological Sciences/The Royal Society, 281(1791), 20141000.CrossRefGoogle Scholar
  26. Langers, D. R., Van, D. P., Schoenmaker, E. S., & Backes, W. H. (2007). fMRI activation in relation to sound intensity and loudness. NeuroImage, 35(2), 709–718.CrossRefPubMedGoogle Scholar
  27. Lasota, K., Ulmer, J., Firszt, J., Biswal, B., Daniels, D., & Prost, R. (2003). Intensity-dependent activation of the primary auditory cortex in functional magnetic resonance imaging. Journal of Computer Assisted Tomography, 27(2), 213–218.CrossRefPubMedGoogle Scholar
  28. Lockwood, A., Salvi, R., Ml, A. S., Wack, D., Murphy, B., & Burkard, R. (1999). The functional anatomy of the normal human auditory system: responses to 0.5 and 4.0 kHz tones at varied intensities. Cerebral Cortex, 9(1), 65–76.CrossRefPubMedGoogle Scholar
  29. Mohr, C. M., King, W. M., Freeman, A. J., Briggs, R. W., & Leonard, C. M. (1999). Influence of speech stimuli intensity on the activation of auditory cortex investigated with functional magnetic resonance imaging. Acoustical Society of America Journal, 105(5), 2738–2745.CrossRefGoogle Scholar
  30. Mustovic, H., Scheffler, K., Di Salle, F., Esposito, F., Neuhoff, J. G., Hennig, J., & Seifritz, E. (2003). Temporal integration of sequential auditory events: silent period in sound pattern activates human planum temporale. NeuroImage, 20(1), 429–434.CrossRefPubMedGoogle Scholar
  31. Nardo, D., Santangelo, V., & Macaluso, E. (2011). Stimulus-driven orienting of visuo-spatial attention in complex dynamic environments. Neuron, 69(5), 1015–1028.CrossRefPubMedGoogle Scholar
  32. Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI. NeuroImage, 56(2), 400–410.CrossRefPubMedGoogle Scholar
  33. Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., & Gallant, J. L. (2011). Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology, 21(19), 1641–1646.CrossRefPubMedPubMedCentralGoogle Scholar
  34. Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences, 10(9), 424–430.CrossRefPubMedGoogle Scholar
  35. Opitz, B., Rinne, T., Mecklinger, A., Von Cramon, D. Y., & Schröger, E. (2002). Differential contribution of frontal and temporal cortices to auditory change detection: fMRI and ERP results. NeuroImage, 15(1), 167–174.CrossRefPubMedGoogle Scholar
  36. Proakis, J. G., & Manolakis, D. G. (1992). Digital signal processing: Principles, algorithms, and applications. Maxwell Macmillan Canada, Maxwell Macmillan International: Macmillan.Google Scholar
  37. Reiterer, S., Erb, M., Grodd, W., & Wildgruber, D. (2008). Cerebral processing of timbre and loudness: fMRI evidence for a contribution of Broca’s area to basic auditory discrimination. Brain Imaging and Behavior, 2(1), 1–10.CrossRefGoogle Scholar
  38. Röhl, M., & Uppenkamp, S. (2012). Neural coding of sound intensity and loudness in the human auditory system. Jaro, 13(3), 369–379.CrossRefPubMedPubMedCentralGoogle Scholar
  39. Saenz, M., & Langers, D. (2014). Tonotopic mapping of human auditory cortex. Hearing Research, 307(1), 42–52.CrossRefPubMedGoogle Scholar
  40. Santoro, R., Moerel, M., De, M. F., Goebel, R., Ugurbil, K., Yacoub, E., & Formisano, E. (2014). Encoding of natural sounds at multiple spectral and temporal resolutions in the human auditory cortex. PLoS Computational Biology, 10(1), e1003412.CrossRefPubMedPubMedCentralGoogle Scholar
  41. Spiers, H. J., & Maguire, E. A. (2007). Decoding human brain activity during real-world experiences. Trends in Cognitive Sciences, 11(8), 356–365.CrossRefPubMedGoogle Scholar
  42. Talavage, T. M., Sereno, M. I., Melcher, J. R., Ledden, P. J., Rosen, B. R., & Dale, A. M. (2004). Tonotopic organization in human auditory cortex revealed by progressions of frequency sensitivity. Journal of Neurophysiology, 91(3), 1282–1296.CrossRefPubMedGoogle Scholar
  43. Toiviainen, P., Alluri, V., Brattico, E., Wallentin, M., & Vuust, P. (2013). Capturing the musical brain with lasso: dynamic decoding of musical features from fMRI data. NeuroImage, 88C, 170–180.Google Scholar
  44. Trost, W., Frühholz, S., Cochrane, T., Cojan, Y., & Vuilleumier, P. (2015). Temporal dynamics of musical emotions examined through intersubject synchrony of brain activity. Social Cognitive and Affective Neuroscience. doi:10.1093/scan/nsv060.PubMedPubMedCentralGoogle Scholar
  45. Uppenkamp, S., & Röhl, M. (2013). Human auditory neuroimaging of intensity and loudness. Hearing Research, 307(1), 65–73.PubMedGoogle Scholar
  46. Welch, P. D. (1967). The use of fast Fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics, 15(2), 70–73.CrossRefGoogle Scholar
  47. Zhao S, Jiang X, Han J, Hu X, Zhu D, Lv J, Zhang T, Guo L, Liu T (2014) Decoding auditory saliency from FMRI brain imaging. Paper presented at the proceedings of the ACM international conference on multimedia, Orlando, Florida, USA.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of AutomationNorthwestern Polytechnical UniversityXi’anChina
  2. 2.Cortical Architecture Imaging and Discovery Lab, Department of Computer Science and Bioimaging Research CenterThe University of GeorgiaAthensUSA

Personalised recommendations