1 Introduction

With continuously collecting neuroimaging datasets such as functional magnetic resonance imaging (fMRI) [1], many studies have been conducted on machine learning techniques to find specific biomarkers of neurological and psychiatric disorders [2] such as schizophrenia [3, 4]. They also provide an opportunity for appropriate treatments and potentially evaluate the effectiveness of the treatments. Since each neuroimaging dataset is still limited in size compared to datasets for other machine-learning tasks, it requires special analysis procedures including hand-crafted features, feature-selections, and dimensional-reductions [3,4,5].

Recent studies reported a certain success by applying generative models to fMRI images [6,7,8]. Generative models classify a small-sized dataset better than discriminative models when their assumptions are appropriate [9]. We can leverage our prior knowledge and auxiliary information by constructing the model structure. Suk et al. [6] used hidden Markov models (HMMs) to model the temporal dynamics underlying fMRI signals. Yahata et al. [5] used the sparse canonical correlation analysis (SCCA) to remove features related to known attributes of no interest (e.g., age and sex). Chen et al. [7] employed a linear model composed of a subset shared by all subjects and the remaining adjusted for expressing the functional topography of each subject. These models take into account the individual variability but cannot generalize to an unknown attribute or subject; the generalization is a fundamental problem for diagnosing disorders [10].

On the other hand, deep neural networks (DNNs) are attracting attention as flexible machine-learning frameworks (see [11] for a review). DNNs learn high-level features of a given dataset automatically. DNNs have been used as a supervised classifier (a multilayer perceptron; MLP) [4, 12] and an unsupervised feature-extractor (an autoencoder; AE) [4, 6, 12]. Not limited to them, DNNs called deep neural generative models (DGMs) build generative models describing relationships between multiple factors in their network structures [13, 14]. Tashiro et al. [8] implemented relationships between fMRI images, class label, and scan-wise variability (signals of no interest, such as something in mind) on a DGM and achieved a better diagnostic accuracy than comparative models.

Given the above, we propose a deep generative model dedicatedly structured for fMRI data analysis called subject-wise DGM (sw-DGM). The proposed sw-DGM takes into account individual variability (i.e., a subject-wise feature), which is shared by and inferred from all fMRI images obtained from a subject. Thanks to this inference, the proposed sw-DGM generalizes to an unknown subject unlike the study by Chen et al. [7] and potentially deals with unknown attributes unlike the study by Yahata et al. [5].

We evaluate the proposed sw-DGM using resting state fMRI (rs-fMRI) datasets of schizophrenia and bipolar disorders. Our experimental results demonstrate that the proposed sw-DGM provides a more accurate diagnosis than the conventional methods based on the functional connectivity extracted using the Pearson correlation coefficients (PCC) [3, 5] and comparative discriminative and generative models; support vector machine (SVM) [15], long short-term memory (LSTM) [16], DGM [8], and AE+HMM [6]. In addition, the proposed sw-DGM identifies brain regions related to the disorders.

Fig. 1.
figure 1

Our proposed generative model composed of fMRI images \(x_{i,t}\), a diagnosis \(y_i\), a subject-wise feature \(s_i\), and scan-wise variabilities \(z_{i,t}\).

2 Subject-Wise Deep Neural Generative Model

2.1 Subject-Wise Generative Model of FMRI Images

We first propose a structured generative model of a dataset \(\mathcal D=\{\varvec{x}_i,y_i\}_{i=1}^{N}\) composed of fMRI images \(\varvec{x}_i\) and class labels \(y_i\) of N subjects indexed by i. Each subject i is a control subject (\(y_i=0\)) or has the disorder (\(y_i=1\)), and provides \(T_i\) fMRI images \(\varvec{x}_i=\{x_{i,t}\}_{t=1}^{T_i}\). We assume that each subject i has its own feature \(s_i\) following a prior distribution p(s). The subject-wise feature represents individual variability, which could be brain shape and baseline signal intensity not removed successfully by preprocessing. We also assume that each fMRI image \(x_{i,t}\) is associated with the subject’s class \(y_i\), the subject-wise feature \(s_i\), and a latent variable \(z_{i,t}\). The latent variable \(z_{i,t}\) follows a prior distribution p(z) and represents a scan-wise variability, e.g., brain activity related to something in the subject’s mind at that moment, body motion, and so on [8]. Then, we build a generative model \(p_\theta \) of fMRI images \(\varvec{x}_{i}\) conditioned by the class label \(y_i\) and parameterized by \(\theta \). This is depicted in Fig. 1 and expressed as

$$\begin{aligned} p_\theta (\varvec{x}_{i}|y_i) =\displaystyle \prod _{t=1}^{T_i} p_\theta (x_{i,t}|y_i) =\displaystyle \prod _{t=1}^{T_i} \int _{s_i} \int _{z_{i,t}} p_\theta (x_{i,t}|z_{i,t},y_i,s_i)p(z_{i,t})p(s_{i}). \end{aligned}$$

According to the variational method [13], the model evidence \(\log p_\theta (\varvec{x}_i|y_i)\) is bounded using an inference model \(q_\phi \) parameterized by \(\phi \) as

$$\begin{aligned} \begin{array}{rcl} \log p_\theta (\varvec{x}_i|y_i) &{}\ge &{}\displaystyle \mathbb E_{q_\phi (\varvec{z}_i,s_i|\varvec{x}_i,y_i)}\left[ \log \frac{p_\theta (\varvec{x}_i,\varvec{z}_i,s_i|y_i)}{q_\phi (\varvec{z}_i,s_i|\varvec{x}_i,y_i)}\right] \\ &{}=&{}-D_{KL}(q_\phi (s_i|\varvec{x}_i,y_i)||p(s_i))\\ &{}&{}-\sum \nolimits _{t=1}^{T_i}\mathbb E_{q_\phi (s_i|\varvec{x}_i,y_i)}\left[ D_{KL}(q_\phi (z_{i,t}|x_{i,t},y_i,s_i)||p(z_{i,t}))\right] \\ &{}&{}+\sum \nolimits _{t=1}^{T_i}\mathbb E_{q_\phi (s_i|\varvec{x}_i,y_i)}\left[ \mathbb E_{q_\phi (z_{i,t}|x_{i,t},y_i,s_i)}\left[ \log p_\theta (x_{i,t}|y_i,z_{i,t},s_i)\right] \right] \\ &{}=:&{}\mathcal L_g(\varvec{x}_i, y_i), \end{array} \end{aligned}$$

where \(D_{KL}(\cdot ||\cdot )\) is the Kullback-Leibler divergence and \(\mathcal L_g(\varvec{x}_i;y_i)\) is the evidence lower bound (ELBO); the ELBO is the ordinary objective function of the conditional generative model \(p_\theta \) and the inference model \(q_\phi \) to be maximized.

The ELBO \(\mathcal L_g(\varvec{x}_i;y)\) is considered to converge to the model evidence \(\log p_\theta (\varvec{x}_i|y)\). We estimate the posterior probability \(p(y|\varvec{x}_i)\) of the class y of a subject i based on Bayes’ rule:

$$\begin{aligned} p_\theta (y|\varvec{x}_i) =\dfrac{p(y)p_\theta (\varvec{x}_i|y)}{\displaystyle \sum _{y'\in \{0,1\}}\!\!p(y')p_\theta (\varvec{x}_i|y')} \approx \dfrac{p(y)\exp \mathcal L_g(\varvec{x}_i,y)}{\displaystyle \!\sum _{y'\in \{0,1\}}\!\!p(y')\exp \!\mathcal L_g(\varvec{x}_i,y')\!} =: \exp \mathcal L_d(\varvec{x}_i,y). \end{aligned}$$

We assume the prior probability p(y) of class y to be \(p(y=0)=p(y=1)=0.5\). Hence, if the ELBO \(\mathcal L_g(\varvec{x}_i,y=1)\) has a large value, the subject i is more likely to have the disorder.

In addition, the approximation of the log-likelihood of the class label, i.e., \(\mathcal L_d(\varvec{x}_i,y_i)\), can be an alternative objective function to be maximized, progressing discrimination between the classes [9]. We balanced the two objective functions using the coefficient \(\omega \in [0,1]\) as

$$\begin{aligned} \mathcal L(\varvec{x}_i,y_i)=\omega \mathcal L_g(\varvec{x}_i,y_i) + (1-\omega ) \mathcal L_d(\varvec{x}_i, y_i). \end{aligned}$$
Fig. 2.
figure 2

Implementation of the proposed generative model on the deep neural networks.

2.2 Implementation on Deep Neural Networks

We implement the generative model \(p_\theta \) and inference model \(q_\phi \) described above on deep neural networks, and thereby, propose a subject-wise deep generative model (sw-DGM). We assume a preprocessed fMRI signal \(x_{i,t}\), a subject-wise feature \(s_i\), and a scan-wise variability \(z_{i,t}\) as vectors of \(n_x\), \(n_s\), and \(n_z\)-dimensions, respectively. The inference model \(q_\phi (z_{i,t}|x_{i,t},y_i,s_i)\) and generative model \(p_\theta (x_{i,t}|y_i, s_i, z_{i,t})\) are expressed by multivariate Gaussian distributions with diagonal covariance matrices; their parameters are the outputs of the corresponding DNNs called encoder and decoder (see the right two panels in Fig. 2 and the previous studies [8, 13, 14] for more detail). The implementation of the inference model \(q_\phi (s_i|\varvec{x}_i,y_i)\) requires modification because it accepts a variable-length sequence of fMRI images \(\varvec{x}_i=\{x_{i,t}\}_{t=1}^{T_i}\) obtained from a subject i. We propose a neural network architecture called collection-encoder, which is composed of stacked two sub-networks as depicted in the leftmost panel in Fig. 2. The first sub-network accepts a preprocessed fMRI signal \(x_{i,t}\) and the class label \(y_i\), and then outputs a hidden activation \(h_{i,t}\). The second sub-network accepts the averaged hidden activation \(\bar{h}_i=\frac{1}{T_i}\sum _{t=1}^{T_i}[h_{i,t}]\) and outputs the variational posterior \(q_\phi (s_i|\varvec{x}_i,y_i)\) of the subject-wise feature \(s_i\).

Note that the proposed sw-DGM is not equivalent to other structured DGMs such as Skip Deep Generative Model [14]. They assumed that each sample is generated with more than two latent variables. In contrast, the proposed sw-DGM assumes that the samples \(x_{i,t}\) obtained from the same subject i share the subject-wise feature \(s_i\) as a latent variable. This assumption potentially gives a good constraint based on a prior knowledge of the fMRI images.

We used three-layered neural networks as the encoder and decoder. We used a two-layered and a single-layered neural networks as the first and the second sub-networks of the collection-encoder, respectively. Each hidden layer of all the DNNs has \(u_h\) hidden units followed by the layer normalization [17] and the ReLU activation function [18]. For approximating the expectations in Eq. (1), the subject-wise feature \(s_i\) and the scan-wise variability \(z_{i,t}\) were sampled from the variational posteriors \(q_\phi (s_i|\varvec{x}_i,y_i)\) and \(q_\phi (z_{i,t}|x_{i,t},y_i,s_i)\) once per sample \(x_{i,t}\) in the training phase and were substituted with the MAP estimations in the test phase following the previous work [13]. The preprocessed fMRI signals \(x_{i,t}\) were augmented using the dropout [19] of a ratio p. All the DNNs were jointly trained using the Adam optimization algorithm [20] with parameters \(\alpha =10^{-4}\), \(\beta _1=0.9\), and \(\beta _2=0.999\). We selected hyper-parameters from \(p\in \{0.0,0.5\}\), \(n_h\in \{50,100,200,400\}\), \(n_z=n_s\in \{5,10,20,50,100\}\) for \(n_h>n_z=n_s\), and \(\omega \in \{0.0,0.9,0.99\}\). We adjusted the imbalance in the classes via oversampling.

3 Experiments and Results

3.1 Data Acquisition and Comparative Models

We used datasets obtained from the OpenfMRI database. Its accession number is ds000030 ( We performed a preprocessing procedure for rs-fMRI using the SPM12 software package ( We discarded the first 10 scans of each subject to ensure magnetization equilibrium. We performed time-slice adjustment, realignment of brain positions via a rigid body rotation, and spatial normalization using the MNI space with a voxel thickness of 2.0 mm. We parcellated each fMRI image into 116 regions of interest (ROIs) using the automated anatomical labeling (AAL) template [21] and averaged intensities of voxels in each ROI region, obtaining a 116-dimensional vector as a preprocessed fMRI signal \(x_{i,t}\). As scrubbing, we discarded frames with frame displacements (FD) of more than 1.5 mm or angular rotations of more than \(1.5^{\circ }\) in any direction as well as the following frames. We also discarded subjects who had less than 100 remaining frames and subjects whose fMRI images did not match the MNI template after the spatial normalization. As a result, we obtained 113 control subjects, 44 patients with the schizophrenia, and 45 patients with the bipolar disorder.

As baselines, we evaluated two conventional procedures, which use Pearson’s correlation coefficients (PCCs) between the ROIs as the functional connectivities (FCs) [3, 5]. Following Shen et al. [3], we selected m features in the FCs using the Kendall \(\tau \) correlation coefficient, compressed the feature vector into a d-dimensional space using the locally linear embedding (LLE) with a parameter of k, and clustered them into two classes using the c-means algorithm. This procedure was confirmed to outperform direct classification of the PCCs by the SVM and MLP. Following Yahata et al. [5], we selected m features in the FCs using the SCCA and classified the features using the sparse logistic regression (SLR) with a sparsity determined by automatic relevance determination (ARD). We selected the hyper-parameters from \(m\in \{50,100,200,400,600\}\), \(k\in \{5,8,10,12,15\}\), and \(d\in \{2,5,10,20,50\}\) following the original study [3].

For comparison, we evaluated classifiers; support vector machine (SVM) [15] and long short-term memory (LSTM) [16]. The SVM accepted a single image \(x_{i,t}\) and outputted a binary value representing the estimated class using linear kernels. The diagnosis of a subject i is determined by majority voting of \(T_i\) estimations, consistent with other comparative models. We selected the hyper-parameter C adjusting the trade-off between classification accuracy and margin maximization from \(C\in \{\dots ,0.1,0.2,0.5,1,2,5,10,\dots \}\). The LSTM is a recurrently connected neural network, accepting fMRI signals \(\varvec{x}_i=\{x_{i,t}\}_{t=1}^{T_i}\) sequentially and outputting the posterior probability \(p(y|\varvec{x}_i)\) using the logistic function. The other conditions were the same as those for the proposed sw-DGM.

Table 1. Diagnostic accuracies.

We also evaluated a simpler DGM proposed in the previous study [8] and hidden Markov model (HMM) with autoencoder (AE) [6]. The DGM modeled relationships between the fMRI signals \(\varvec{x}_i\), the class label \(t_i\), and the scan-wise variability \(z_{i,t}\) using an encoder \(q(z_{i,t}|x_{i,t},y_i)\) and a decoder \(p(x_{i,t}|y_i,z_{i,t})\) but does not take into account the subject-wise feature \(s_i\) [8]. Following Suk et al. [6], we compressed each fMRI image into a d-dimensional space using an AE. Then, we trained a pair of HMMs; \(p_\theta (x_{i,t}|y=1)\) for patients and \(p_\theta (x_{i,t}|y=0)\) for control subjects. Each HMM had Gaussian distributions with full covariance matrices and was trained using Expectation-Maximization (EM) algorithm. We calculated the posterior probability \(p(y|\varvec{x}_i)\) using Bayes’ rule. We selected the number \(n_z\) of units in the bottleneck layer from \(n_z\in \{2,3\}\), the number n of mixture components of the HMM from \(n\in \{ 2,3,4,5,6,7\}\), and the hyper-parameters of the AE in the same ranges as the proposed sw-DGM.

Table 2. Top 5 contribution weights for diagnosis.

3.2 Results of Diagnosis and Contribution Weights of ROIs

Since the datasets are imbalanced, we used the following measures; sensitivity \(\mathrm {SEN}=\mathrm {TP/(TP + FN)}\), specificity \(\mathrm {SPEC}=\mathrm {TN/(TN + FP)}\), and balanced accuracy \(\mathrm {BACC}=0.5\times (\mathrm {SEN+SPEC})\), where \(\mathrm {TP}\), \(\mathrm {TN}\), \(\mathrm {FP}\), and \(\mathrm {FN}\) denote true positive, true negative, false positive, and false negative, respectively. We performed 5 trials of 10-fold cross-validation (CV) and summarized the results in Table 1. The proposed sw-DGM achieved the best balanced accuracies among the competitive approaches in the both datasets. Especially, the proposed sw-DGM outperformed or at least performed no worse than the existing DGM [8], implying that the introduction of the subject-wise feature \(s_i\) (i.e., individual variability) worked as an appropriate constraint.

As shown in Eq. 2, the diagnosis of a subject i is based on the difference in the conditional log-likelihood \(\log p_\theta (\varvec{x}_i|y)\) between the class labels \(y=0\) and \(y=1\). Since each element \(x_{i,t,r}\) of an fMRI signal \(x_{i,t}\) corresponds to an ROI r, we can calculate the ROI-wise average marginal log-likelihoods \(\mathbb E_{i,t}[\log p_\theta (x_{i,t,r}|y_i)|x_{i,t}]\). An ROI with a large difference in the log-likelihoods between correct and incorrect labels has a large effect on the accurate diagnosis. Hence, we defined

$$\begin{aligned} W_r = \mathbb E_{i,t}\left[ \log p_\theta (x_{i,t,r}\vert y_j)-\log p_\theta (x_{i,t,r}\vert 1-y_j)\vert x_{i,t}\right] \end{aligned}$$

as the contribution weight \(W_r\) of the ROI r and summarized the ROIs with the top 5 contribution wights in Table 2. Previous studies (e.g., the review paper [22]) have discussed the relationships of some of the listed ROIs to the disorders. The results suggest that the proposed sw-DGM identified the ROIs related to the disorders.

4 Conclusion

This study proposed a subject-wise deep generative model (sw-DGM) of fMRI images dedicatedly structured for diagnosing psychiatric disorders. The sw-DGM modeled the joint distribution of rs-fMRI images, class label, individual variability, and scan-wise variability. The individual variability worked as an appropriate constraint, and the sw-DGM achieved a diagnostic accuracy higher than other conventional and comparative approaches. Also, the sw-DGM identified brain regions related to the disorders.