Decoding power-spectral profiles from FMRI brain activities during naturalistic auditory experience
Recent studies have demonstrated a close relationship between computational acoustic features and neural brain activities, and have largely advanced our understanding of auditory information processing in the human brain. Along this line, we proposed a multidisciplinary study to examine whether power spectral density (PSD) profiles can be decoded from brain activities during naturalistic auditory experience. The study was performed on a high resolution functional magnetic resonance imaging (fMRI) dataset acquired when participants freely listened to the audio-description of the movie “Forrest Gump”. Representative PSD profiles existing in the audio-movie were identified by clustering the audio samples according to their PSD descriptors. Support vector machine (SVM) classifiers were trained to differentiate the representative PSD profiles using corresponding fMRI brain activities. Based on PSD profile decoding, we explored how the neural decodability correlated to power intensity and frequency deviants. Our experimental results demonstrated that PSD profiles can be reliably decoded from brain activities. We also suggested a sigmoidal relationship between the neural decodability and power intensity deviants of PSD profiles. Our study in addition substantiates the feasibility and advantage of naturalistic paradigm for studying neural encoding of complex auditory information.
KeywordsPower-spectral profile fMRI brain decoding Auditory intensity-encoding Frequency-encoding Naturalistic paradigm
Exploring neural dynamics during real-world experience has gained increasing interest in recent years (Hasson and Honey 2012; Spiers and Maguire 2007; Hasson et al. 2004; Nishimoto et al. 2011; Alluri et al. 2012; Huth et al. 2012; Nardo et al. 2011; Bartels and Zeki 2005; Bordier et al. 2013). It has been argued that “the ecological validity of any finding discovered within a controlled laboratory setup is not clear until tested in real-life contexts” (Hasson and Honey 2012). However, complex naturalistic stimuli introduce challenges in quantitative modeling of the inputs, and consequently make it difficult to infer brain-behavior relationships using traditional hypothesis-based methods (e.g., general linear model, GLM) (Bordier et al. 2013). One promising solution to this challenge is borrowing perceptually relevant acoustic (Alluri et al. 2012; Cong et al. 2013; Toiviainen et al. 2013; Santoro et al. 2014) and visual descriptors (Nardo et al. 2011; Bordier et al. 2013; Nishimoto et al. 2011; Hu et al. 2015) that are widely used in audio and video processing communities. It provides a practical mapping from input space to feature space (Naselaris et al. 2011).
Recent neuroimaging studies have demonstrated a close relationship between computational acoustic descriptors and brain activities during naturalistic auditory experience. For example, Alluri et al. correlated acoustic features related to timbral, rhythmical and tonal properties of music to fMRI brain activities, and showed the large-scale brain circuitry dedicated to acoustic feature processing (Alluri et al. 2012). Toiviainen et al. reported that timbral and rhythmic features but not tonal features can be accurately decoded from fMRI brain activities in the auditory cortex, cerebellum and hippocampus (Toiviainen et al. 2013). Alluri et al. showed that the temporal evolution brain activities can be predicted by musical features in a wide range of cortex (Alluri et al. 2013). Other studies also showed that it is possible to decode acoustic categories and auditory saliency from fMRI brain activities (Klein and Zatorre 2015; Kumar et al. 2014; Fang et al. 2015; Ji et al. 2015; Zhao et al. 2014; Han et al. 2015).
Surprisingly, as one of the most basic descriptors in acoustic information processing, power spectral density (PSD) has been rarely explored in the aspect of how it correlates to brain activities. PSD depicts the power intensity distribution of temporal signals in frequency domain (Proakis and Manolakis 1992). Thus, we expect that correlating PSD profiles with brain activities during naturalistic experience may provide novel insights into auditory intensity-encoding and frequency-encoding in the human brain. For example, the naturalistic paradigm used in this study is free from any specific task instructions and demanding, and consequently it may eliminate the discrepancies about neural auditory intensity-encoding caused by the difference in task designs such as comparison or categorization of auditory intensity (Angenstein and Brechmann 2015). In addition, the naturalistic paradigm is equipped with enriched stimulus types such as male narrations, musical and environmental background, and dialogs among characters in different genders and ages. It is expected that if the brain responses are capable of differentiating PSD profiles in multiple types, the identified discriminative brain responses are unbiased by stimulus types. Thus, the naturalistic paradigm provides the possibility to alleviate the discrepancies caused by the difference in stimuli types (e.g., frequency modulated tones and speech) (Angenstein and Brechmann 2015; Reiterer et al. 2008).
The purpose of this study is to examine whether PSD profiles can be decoded from fMRI brain activities during naturalistic auditory experience, and to explore how the neural decodability quantitatively correlates with intensity and frequency differences of the PSD profiles. We adopted a public high resolution fMRI dataset which were acquired when the participants freely listened to the audio-description of the movie “Forrest Gump” (Hanke et al. 2014). The representative PSD profiles existing in the audio-movie were identified by the clustering of the audio samples according to their PSD descriptors. SVM classifiers were trained to decode the PSD profiles using corresponding brain activities. We quantitatively measured the neural decodability of PSD profiles by the volume of discriminative voxels and decoding accuracy. Our experimental results showed that PSD profiles can be reliably decoded from fMRI brain activities. We also suggested a sigmoidal relationship between the neural decodability and power intensity difference of PSD profiles.
Materials and methods
Dataset and preprocessing
The dataset used in this study are available at: http://studyforrest.org/ (Hanke et al. 2014). We briefly describe the dataset as follows. Twenty right-handed participants (age 21–38 years, mean age 26.6 years, 12 male) whose native language is German were recruited. All of them were reported to have normal hearing and no history of neurological disorders. The German audio-description of the movie “Forrest Gump” was used as a naturalistic paradigm for fMRI scan. The audio-description is particularly designed for visually impaired audiences, and its content is largely identical to the soundtrack of the movie in German except for interspersed male narrations which describes the visual content of a scene when there is no relevant audio-content in the movie. The audio stream was processed by a series of filters (Hanke et al. 2014) to achieve optimal saliency of the audio stimulus with regard to the MRI scanner noise.
The audio-description was divided into 8 segments. FMRI data were acquired in two separate sessions. Each session consisted of 4 runs. In each run, one of the segments was presented to the participants using MRI compatible in-ear headphones. T2*-weighted gradient-echo EPI were acquired using a 7 Tesla Siemens MAGNETOM scanner with a 32 channel brain receive coil. Axial slices were oriented to include the ventral portions of frontal and occipital cortex. 36 slices were recorded. The scanning parameters were as follows: repetition time (TR) = 2000 ms, echo time (TE) = 22 ms, echo spacing =0.78 ms, isotropic voxel size 1.4 mm, field-of-view (FOV) = 224 mm. In total, 3599 volumes were recorded for each participant in the 8 runs (451, 441, 438, 488, 462, 439, 542, and 338 volumes for runs 1–8, respectively). High-resolution T1- and T2-weighted structural images for structural reference were acquired using a 3D turbo spin echo (TSE) sequence. The parameters were as follows: TR = 2500 ms and TE = 5.7 ms for T1-weighted image, TR = 2500 ms and TE = 230 ms for T2-weighted image, FoV = 191.8 × 256 × 256 mm, 274 sagittal slices, in-plane matrix size 384 × 384. FMRI data are available for 18 out of the 20 participants.
The released dataset were skull removed, motion- and distortion corrected. A group-specific EPI template was included. All fMRI data were aligned to the EPI template. For each run, the average volume was computed and aligned to the group-specific EPI template through an affine transformation estimated by FSL FLIRT (Jenkinson et al. 2012). In addition, the affine transformation was combined with a non-linear warping estimated by FSL FNIRT (Hanke et al. 2014). We performed additional pre-processing including slice-timing correction and high-pass filtering with a 128 s cut-off. No spatial smoothing was applied. The fMRI signal for each voxel was z-scored in each run separately.
Welch’s PSD descriptor
We introduce the Welch PSD descriptor to model the power spectral profiles of the complex naturalistic auditory stimuli, severing as a practical mapping from the input space to feature space for fMRI brain decoding. In the literature, parametric and nonparametric methods for power spectral estimation (PSE) have been proposed. Nonparametric methods do not make any assumptions on the data (Proakis and Manolakis 1992). Considering naturalistic audio signals are typically modeled as random processes, we used the Welch PSE (Welch 1967), which is one of the most popular nonparametric PSE methods. In Welch PSE, the signal is divided into K segments. By allowing the segments to overlap, it eliminates the tradeoff between the variance and spectral resolution. The classical periodogram (Proakis and Manolakis 1992) is then applied to each of the K segments separately. The final power spectrum is the average of the estimated power spectrum for the K segments. There are a few parameters in Welch PSE, including the shape and the length of the window, and the overlap rate between windows. We used a common practice of Hamming window in our study. The length of the window was 64 and the overlap rate was 50 %. A 129-dimensional PSD descriptor was estimated for each of the 2 s audio samples (corresponding to an fMRI volume).
Identification of representative PSD profiles
In order to identify representative PSD profiles in the stimuli, the PSD descriptors of all the audio samples were grouped into clusters based on affinity propagation (AP) algorithm (Frey and Dueck 2007). AP takes the pair-wise similarity matrix as input and determines the number of clusters automatically. In order to increase the homogeneity of the resulted PSD clusters and to identify PSD clusters with distinguishable patterns, we adopted a two-stage clustering strategy. In the first stage, AP clustering with a large cluster number was performed. Afterwards, we calculated the average intra-cluster distance as the mean of the pair-wise Euclidean distances of all the instance-pairs for each cluster. The average intra-cluster distances were then z-scored and those clusters with z-score above 1.65 were discarded as noise. The centers of those preserved clusters were subjected to the second stage AP clustering. After the two-stage AP clustering, a cluster label was assigned to each audio sample to represent its PSD profile.
PSD profile decoding
To examine whether PSD profiles were decodable and to quantitatively explore how their neural decodability correlated to power intensity and frequency difference, we designed binary classification tasks on paired clusters. In each binary classification task, a classifier based on SVM (Cortes and Vapnik 1995) was trained to differentiate the PSD profiles of a cluster-pair using corresponding brain activities (fMRI volumes). The hemodynamic response function (HRF) introduces lag of brain activities in fMRI data. A typical HRF has the peak at 5 s and the undershoot at 15 s. To compensate for the time lag introduced by HRF, we temporally shifted the time course of the labels of PSD profiles 2 TRs (4 s) behind (Nishimoto et al. 2011).
Voxel selection is essential in fMRI decoding due to the high spatial resolution of the fMRI data (Norman et al. 2006; Naselaris et al. 2011). We adopted a two-stage voxel selection strategy. In the first stage, the voxels were filtered by inter-subject correlation (ISC) to identify consistent brain activities across participants (Hasson et al. 2004; Toiviainen et al. 2013). For each voxel, ISC was calculated as the mean Pearson correlation coefficient between the fMRI signals of all possible participant-pairs. A significance (p < 0.05, FDR corrected) threshold was inferred via permutation test (Kauppi et al. 2014). ISC map and significance threshold were separately calculated for each of the 8 runs. Then we derived an average ISC map and an average threshold over the 8 runs. Voxels with significant ISC were preserved and subjected to the second-stage voxel selection. In the second-stage voxel selection, the fMRI time courses of each voxel in the 18 participants was averaged. Supervised voxels selection based on two-sample t-test was performed to identify discriminative (p < 0.05, FDR corrected) voxels in each binary classification task. The number of discriminative voxels was used a quantitative measure to the neural decodability.
A SVM classifier with linear kernel was trained to differentiate the labels of a paired PSD clusters using the selected brain activities. We performed a principal component analysis (PCA) to remove the redundancy existing in the selected brain activities and to further reduce the data dimension. To avoid the ‘peeking’ problem in brain decoding (Naselaris et al. 2011), n-fold cross-validation was adopted. In each fold, the supervised t-test voxel selection, PCA-based redundancy removing and classifier training were performed on the n-1 folds of training samples, and the classifier was tested on the rest 1 fold of testing samples. The overall decoding accuracy was measured as the average classification accuracy in cross-validations, and used as another quantitative measure to the neural decodability.
Exemplar PSD descriptors
Representative PSD profiles
ISC-based brain activity selection
Discriminative voxels and decoding accuracy
Neural decodability of PSD profiles
Cluster-pairs whose neural decodability was determined by power intensity difference (Gpow), frequency difference (Gfreq), and both power intensity and frequency difference (Gboth). Gneit denotes the undecodable inter-group cluster-pairs
(1 2), (1 3), (1 6), (1 7), (1 10), (1 12), (2 3), (2 6), (2 7), (2 10), (2 12), (3 6), (3 7), (3 10), (3 12), (6 7), (6 10), (6 12), (10 12), (4 5), (4 7), (4 8), (5 7), (5 8), (7 8), (7 9), (7 10), (7 11), (7 12), (9 11)
(1 4), (1 8), (1 11), (2 11), (2 8), (3 4), (3 5), (3 8), (3 9), (3 11), (4 10), (4 12), (5 12), (8 12), (11 12)
(1 5), (1 9), (2 4), (4 6), (4 9), (5 6), (5 10), (6 8), (6 11), (8 10), (9 10), (9 12), (10 11)
(2 5), (2 9), (4 11), (5 9), (5 11), (6 9), (8 9), (8 11)
Decodability vs. power intensity difference
Decodability vs. frequency difference
Decdoability vs. power and frequency difference
Discussions and Conclusions
In this paper, we have proposed a multidisciplinary study to examine whether PSD profiles, which is one of the most basic computational acoustic features, is decodable from fMRI brain activities when the participants were exposed to naturalistic auditory experience. The proposed method integrated PSD analysis of naturalistic auditory stimuli which severs as a practical mapping from input space to feature space, and brain decoding analysis which dedicates to the interpretation of the brain-behavior relationship. Based on the proposed method, we have explored the relationship between neural decodability and power and frequency difference of PSD profiles. Our experimental results showed that PSD profiles can be reliably decoded from fMRI brain activities. We suggested a sigmoidal relationship between the neural decodability and power intensity difference of PSD profiles. We also observed the dominance of STS, HG and PT in the neural encoding of power intensity difference during naturalistic auditory experience.
The dominance of STS, HG and PT in neural encoding of auditory intensity difference observed in our study is in line with existing fMRI studies based on task-paradigms (Angenstein and Brechmann 2015; Lasota et al. 2003; Mustovic et al. 2003; Opitz et al. 2002; Reiterer et al. 2008; Dykstra et al. 2012). Based on the various stimulus types (e.g., tone complexes, frequency modulated tones and unmodulated harmonic complexes) and task-designs (e.g., intensity comparison and intensity categorization) (Angenstein and Brechmann 2015; Lasota et al. 2003; Mustovic et al. 2003; Opitz et al. 2002; Reiterer et al. 2008; Dykstra et al. 2012), it was concluded that the tuning of HG, STG and PT to intensity deviations was irrespective of stimulus or task type (Angenstein and Brechmann 2015). In comparison, we used naturalistic auditory stimuli, which contains enriched stimulus types and explicitly eliminates any specific task demand. Thus, it provides an experimental environment with improved ecological validity.
Quantitative relationship between the neural decodability and power intensity difference has not been documented in the literature. We suggested a sigmoidal relationship with a response threshold around 4 dB and a saturation effect around 12 dB. This observation is partly supported by previous studies on sound intensity processing. For example, the saturation of the activation volume at the highest auditory intensity levels (Lockwood et al. 1999; Mohr et al. 1999; Bilecen et al. 2002) and the steep rise of the activation volume between 0 and 10 dB (Langers et al. 2007) have been reported in existing studies. In addition, several studies have reported non-linear relationship between the volume of activation and sound intensity levels (Uppenkamp and Röhl 2013; Röhl and Uppenkamp 2012). In addition, the sigmoid relationship revealed in our study may indicate that the recently advanced deep neural networks (Hinton 2002) with a sigmoid function can potentially improve the decoding performance.
No obvious relationship between neural decodability and frequency difference of the PSD profiles was observed in our study. The topographic organization of the human auditory cortex has been well documented (Talavage et al. 2004). Neighbouring frequencies are represented in topologically neighbouring regions in the auditory cortex. Thus, frequency differences presumably result in brain activities in different brain regions, rather than different activation volume or fMRI signal changes. The observation may also attributes to the limited range of effective frequency of the stimuli used in our study. The PSD profiles in Fig. 1(c) showed that the maximum effective frequency of the audio samples was around 2000 Hz, and the maximum frequency difference among PSD profiles was about 311.3 Hz. However, in typical tonotopicity studies, the frequency of the acoustic stimuli ranges from 200 Hz to 8000 Hz (Saenz and Langers 2014). The limited number of PSD profiles with frequency difference may prevent us from a reliable description of the quantitative relationship between neural decodability and frequency difference of PSD profiles. It is expected that PSD descriptors with fine-gained spectral resolution may alleviate this problem.
In conventional auditory cortex mapping based on functional brain imaging, the intensity of the sitimuli was typically fixed in frequency-encoding studies, while the frequency was typically fixed in intensity-encoding studies. Thus, a comprehensive auditory cortex mapping gives rise to high cost and difficulties in brian imaging acquisition since a large number of intensity-frequency combinations is necessary (Saenz and Langers 2014). In comparison, naturalistic audio streams are equipped with enriched combinations of frequency and intensity. Thus, we envision that future studies on auditory cortex mapping can benefit remarkably from naturalistic paradigm. Supportively, the multidisciplinary study proposed in this paper on the one hand substantiates the feasibility and advantage of naturalistic paradigm for studying neural encoding of complex auditory information, on the other hand, provides an analytical framework for future studies.
Compliance with ethical standards
This study was funded by National Natural Science Foundation of China (NSFC) 61103061, 61333017, 61473234 and 61522207, and the Fundamental Research Funds for the Central Universities 3102014JCQ01065.
Conflict of Interest
All co-authors have seen and agreed with the contents of the manuscript. We have no relevant conflicts of interest.
This article does not contain any studies with human participants performed by any of the authors.
- Bilecen D, Seifritz E, Scheffler K, Henning J, AC S (2002) Amplitopicity of the human auditory cortex: an fMRI study. NeuroImage 17 (2):710–718.Google Scholar
- Cong, F., Alluri, V., Nandi, A. K., Toiviainen, P., Rui, F., Abu-Jamous, B., Gong, L., Craenen, B. G. W., Poikonen, H., & Huotilainen, M. (2013). Linking brain responses to naturalistic music through analysis of ongoing EEG and stimulus features. Multimedia IEEE Transactions on, 15(5), 1060–1069.CrossRefGoogle Scholar
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.Google Scholar
- Hu, X., Lv, C., Cheng, G., Lv, J., Guo, L., Han, J., & Liu, T. (2015). Sparsity-constrained fMRI decoding of visual saliency in naturalistic video streams. Autonomous Mental Development, IEEE Transactions on 7, 2, 65–75.Google Scholar
- Proakis, J. G., & Manolakis, D. G. (1992). Digital signal processing: Principles, algorithms, and applications. Maxwell Macmillan Canada, Maxwell Macmillan International: Macmillan.Google Scholar
- Toiviainen, P., Alluri, V., Brattico, E., Wallentin, M., & Vuust, P. (2013). Capturing the musical brain with lasso: dynamic decoding of musical features from fMRI data. NeuroImage, 88C, 170–180.Google Scholar
- Zhao S, Jiang X, Han J, Hu X, Zhu D, Lv J, Zhang T, Guo L, Liu T (2014) Decoding auditory saliency from FMRI brain imaging. Paper presented at the proceedings of the ACM international conference on multimedia, Orlando, Florida, USA.Google Scholar