Local dimension-reduced dynamical spatio-temporal models for resting state network estimation

To overcome the limitations of independent component analysis (ICA), today’s most popular analysis tool for investigating whole-brain spatial activation in resting state functional magnetic resonance imaging (fMRI), we present a new class of local dimension-reduced dynamical spatio-temporal model which dispenses the independence assumptions that severely limit deeper connectivity descriptions between spatial components. The new method combines novel concepts of group sparsity with contiguity-constrained clusterization to produce physiologically consistent regions of interest in illustrative fMRI data whose causal interactions may then be easily estimated, something impossible under the usual ICA assumptions.


Introduction
There is an ever-growing and pressing need for accurately describing how brain regions are dynamically interrelated in resting state fMRI [4]. Thanks to the nature of BOLD signals, resting state interactions cannot be split into separate space and time descriptions, especially if the focus lies on characterizing spatial changes associated to a small number of regions of interest. The chief challenge is that any dynamical spatio-temporal model (DSTM) of fMRI datasets demands many parameters to describe what is also a large number of observed variables which, nonetheless, enjoy a great deal of spatial redundancy [3,5,37]. Estimating the spatial origin of signal variability for only relatively short sample sizes using DSTMs is problematic especially under the rather usual unfavourable signal-tonoise ratio (SNR) conditions [8,24,28,34].
To circumvent limitations of modelling high-dimensionality systems, Wikle and Cressie [33] proposed dimension-reduced DSTMs aimed at capturing nonstationary spatial dependence under optimal state representations using Kalman filtering. In their formulation of DSTM, they invoke an a priori defined orthogonal basis to expand the redistribution kernel of a discrete time/continuous space, linear integro-difference equation (IDE) in terms of a finite linear combination of spatial components [33]. This idea was further supported in [14] and extended in [26] who considered parametrized redistribution kernels of arbitrary shape that meet homogeneity conditions in both space and time. Even though the base changes of [33] improve the understanding of high-dimensional processes, they by no means ensure sparse solutions which are key to achieving statistically robust dynamical descriptions.
Model robustness has alternatively been sought by indirect means as, for example, thru LASSO regression [29] and basis pursuit [6] for model selection and denoising, or sparse component analysis for blind source separation [39] and finally by iterative thresholding algorithms for image deconvolution and reconstruction [12,17]. The latter methods seek sparsity by maximizing a penalized loss function in a compromise between the goodness of fit and the number of basis elements that make up the signal. Recently, more attention has been given to group sparsity, where groups of variables are selected/shrunken simultaneously rather than individually (for a review see [2]). This is achieved by minimizing an objective function that includes a quadratic error term added to a regularization term that considers a priori beliefs or data-driven analysis to induce group sparsity [35,36,38].
The present paper extends the results in [31] about local dimension-reduced DSTMs (LDSTMs) involving statespace formulations that are suited to datasets of high dimensionality such as fMRI. LDSTMs take advantage of a sparsifying spatial wavelet transformation to represent the data thru fewer significant parameters which are then combined via sparsity and contiguity-constrained clustering to initialize the observation matrix and sources of a tailored expectation maximization (EM) algorithm. The main assumptions here are that the system is overdetermined (there exist more observed signals than sources) and that the columns of the observation matrix act as pointspreading functions (see Sect. 2). Finally, results are gauged using simulated data (Sect. 4) followed by further method illustration with directed connectivity disclosure using real fMRI resting state data.

Problem formulation
DSTM problems may be formulated as state-space models (see [9] for a comprehensive review of DSTM) where space-related measurements z t depend on the dynamical evolution of a suitably defined source vector x t through a linear gaussian model where z t is an M dimensional column vector of observed signals at time t; x t is an K dimensional column vector of unknown source vectors, A is an unknown M Â K observation matrix, H l for 1 l L are unknown K Â K matrices that describe source vector dynamics, w t is an innovation process and v t is an additive noise. Both w t and v t are assumed zero mean gaussian, respectively, with covariance Q and R. The H l matrices, the observation matrix A together with Q and R and x t must be inferred from z t . For added generality, Eq. (1) is presented in a slightly extended form compared to the corresponding model in [31]. Under the latter premises, the log-likelihood of model (1, 2) is given by Þ and vec stands for the column stacking operator [27].
The EM algorithm has long been the favourite tool to solve (1,2) for x t because (3) is sure to converge to at least a local maximum [13,27]. The traditional EM algorithm starts with randomly generated solutions for all parameters and then proceeds by re-iterating its two main steps until the maximum of (3) is attained. It begins with the E-step where the unknown x t are replaced by their expected values given the data and current model parameter estimates. Under gaussian assumptions, the expected system x t are obtained via the Rauch-Tung-Striebel (RTS) smoother [25]. In the second algorithm step, the M-step, one estimates model parameters by maximizing the conditional expected likelihood from the previous E-step. In practice, EM algorithm performance degrades rapidly for highdimensional systems under (1,2). Its solution may even become indeterminate and improper initialization, in fact, often deteriorates estimate quality.
To achieve robust EM solutions, we take into account two common neuroscientist concerns as to what constitute meaningful brain activity components: (a) x t be an economic (i.e. compact/low dimensional) dynamical representation of the brain resting state fMRI dataset as a whole and (b) solutions must be spatially localized, i.e. their associated activation areas mathematically reflect point-spreading functions. We show that the latter assumptions not only allows estimating (1,2) parameters but also x t using the simpler Local Sparse Component Analysis discussed in [32] on z t . The nutshell description of the present algorithm is represented in Fig. 1. The aim is to find initial estimators for the observation matrix and system states which are used to initialize a EM algorithm for maximization of (3).
where U is the M Â M orthonormal matrix, whose rows are the / m 's. With obvious notation, The transform U should be chosen such that a tailored clustering of the rows ofŜ provides the elements that approximate the rows of X. But before this step,Ŝ must be estimated using the sparsity assumption which implies finding a sparse representation ofẐ that captures its intrinsic degrees of freedom.
By considering that s t ¼ Ax t admits a sparse representation lying in B s 1;1 , a particular kind of Besov space [23], approximating z t by s t 2 B s 1;1 ; can be expressed by adding a penalization term to kz t À s t k 2 2 requiring that ks t k s;1 be small, where ks t k s;1 is the B s 1;1 norm of s t . In other words, we want to minimize the following function: whereŝ m;t ¼ hs t ; / m i and k m [ 0 for 1 m M are regularization parameters [12]. For each t, the above function is coercive and strictly convex which means that it has a unique global minimum. If k m ¼ k; the minimum value of (5) is obtained via the soft-thresholding operator [15] Sinceŝ m;t can be zero for some values of t but not for others, the estimator (6) does not ensure sparsity of s t over time even for large k values. To overcome this problem, we propose tyingŝ m;t for 1 t T together and using a recently introduced group-separable regularizer for the functional (5) but in the wavelet domain whereẑ m andŝ m are the m-th rows ofẐ andŜ; respectively. Given k m , solving (7) is achieved by the vector softthresholding operator [7,35] In practice, we still need to estimate k m in (8) for signal denoising. Since U is orthogonal, if R ¼ r 2 I MÂM , then v m $ N 0; r 2 I TÂT ð Þ , wherev m is the m-th row ofV. For very large datasets, this assumption is quite strong but commonly employed in literature. As z t is sparse under U, most of fŝ m;t g 8m must be zero. Provided that fifty percent of fŝ m;t g 8m are zero, the following unbiased estimator for r 2 can be defined whereVAR denotes temporal sample covariance. If VARfŝ m;t g ¼ 0, we have thatẑ m;t are i.i.d normal variables, so implies that an interval with ð1 À aÞ confidence for r 2 is given by Fig. 1 The main algorithm consists of (i) the application of a sparsifying spatial wavelet transformation, resulting into a description in terms of wavelet coefficient time series, (ii) contiguity-constrained clustering of the time series of wavelet coefficients by grouping only nearby coefficients and (iii) estimation of the observation matrix and system states by linear dimensionality reduction of the identified clusters where v 2 n;m is the n-th percentile of the chi-square distribution with m degrees of freedom. Since kẑ m k 2 ¼ ðN À 1ÞVARfẑ m;t g, (11) leads to k m given by with a ¼ 0:05=M.

Contiguity-constrained clustering
The next step consists of determining which time series of wavelet coefficientsŝ m are associated to each spatial component a k x k , where a k is the k-th column of A and x k is the k-th row of X. For this, we use the spatial localization assumption. As the columns of the observation matrix are point-spreading functions, they should be perfectly described by wavelet coefficients forming localized spatial patterns. In this case, each spatial component can be determined using a clustering algorithm enforcing spatial contiguity. One way of achieving this is to apply complete linkage hierarchical clustering with the help of a dissimilarity measure that combines the time series temporal correlation and the physical distance between the wavelet coefficients. In this case, complete linkage hierarchical clustering is attractive because it yields relatively homogeneous clusters, a key property for subsequent accurate reduction of cluster dimensionality.
Clusterization begins with eachŝ m defining a singleton cluster. At each step, it groups a pair ðA; BÞ of clusters under the condition of minimizing the following distance function: where where corðŝ i ;ŝ j Þ denotes the correlation betweenŝ i and s j ; ds defines de center of mass of / i and l i is the scale index of / i in the wavelet decomposition. Accordingly, the above dissimilarity measure combines the absolute value of the correlation coefficient and the physical distance between the wavelet coefficients. Clusterization stops when the minimal distance between the clusters is larger than r (i.e. minfdistðA; BÞ : 8A; 8Bg [ r), for some appropriately chosen r thus leading to a list of cluster memberships that characterize the system's spatial components.
Even though the dissimilarity measure (14) already establishes much of the structure that forms the spatial components of (17), one must decide when to stop clustering by an appropriate value of r. Note that the distðA; BÞ depends solely on the correlation between the wavelet coefficients in A and B. The Fisher z-transform of correlation coefficients, 0:5 log e ð 1þr 1Àr Þ, follows a well-known statistic whose upper limit with an ð1 À a=2Þ % confidence under the null hypothesis of independence is approximately where z ð1Àa=2Þ is the standard normal. Hence, we set the stopping value as for a ¼ 0:05, which interestingly allows estimating the number of spatial components with reference neither to the actual noise level nor to the number of variables, but solely depending on sample size.

Within cluster dimensionality reduction
The next step consists of estimating the observation matrix A and system states of (1, 2) by linear dimensional reduction of each spatial cluster identified in the previous step. After clustering the rows ofŜ, the k-th spatial component a k x k can be approximated by where Y k is an M Â T data matrix, / À1 i is the i-th column of the inverse of UðUT, for wavelet transforms) and I k contains the indexes of the k-th cluster. We assume that the rows of Y k have zero mean, otherwise their mean value can be removed after (17).
According to the approximation model, where E k is an M Â T approximation error matrix, and one must find a k and x k minimizing the approximation error where k Á k F denotes the Frobenius norm. In fact, each spatial component a k x k is a rank-one M Â T matrix given by the first singular value of Y k , i.e.
where r 1 is the largest singular value of Y k , and where u 1 and v 1 are, respectively, the left-singular vector and the right-singular vectors associated to r 1 . With no loss of generality, we consider that the norm of a k equals one leading to a k ¼ u 1 ð21Þ

LDSTM parameter estimation
The remainder of the algorithm consists of applying the traditional EM algorithm for x t estimation [27] using the estimators for x k and a k from previous section to set the initial values for x t and A. Additionally, during the iterative process, A matrix estimation is modified to accommodate linear equality constraints that ensure well-localized a k 's. This is done by solving the following least squares problem:

Numerical illustration
Using simulated data to examine algorithm performance under different conditions, we created a vector time series corresponding to points on a discretized one-dimensional space consisting of M ¼ 256 space points whose activity evolves in over a period of T ¼ 500 points each. The observation matrix that we used (Fig. 2a) consists of the columns of with r 2 accounting for the SNR level defined as SNR ¼ 10 log 10 ðVARðsÞ=r 2 Þ where s ¼ vecð Ax 1 Á Á Á Ax N ½ Þ . The dynamics of the spatial components evolved according to a first-order autoregressive model ðL ¼ 1Þ with  Figure 2b shows the sample variance for a simulated DSTM using the above parameters under SNR ¼ À19db.
We used Daubechies (D2) functions to transform the data and gauged performance by executing 100 Monte Carlo simulations leading to the mean and deviation results as shown in Fig. 3. Algorithm effectiveness was evaluated in terms of how well sources were recovered, as measured by their correlation to the estimated x t , and by how well H l and Q could be estimated as evaluated by computing the connectivity between states using Partial Directed Coherence (PDC) [1].

Simulation results
The mean absolute values of the correlation coefficient between the simulated and estimated sources versus SNR in Fig. 3a show that LDSTM outperforms traditional EM, with very good results for all the three sources even under very unfavourable SNR. Figure 3b shows PDC from x 2 towards x 1 for different SNR levels compared to the corresponding EM estimates. Correct PDC patterns were obtained whose magnitude decreases as SNR decreases but whose overall shape remains.

Real FMRI data
For further illustration purposes, we used fMRI images from seven healthy volunteers under a resting state protocol (approved by the local ethical committee and under individual informed written consent).

LDSTM preprocessing
Motion and slice time correction and temporal high-pass filtering (allowing fluctuations above 0:005 Hz) were carried out using FEAT v5:98. The fMRI data were aligned to the grey matter mask via FreeSurfer's automatic registration tools (v. 5.0.0) resulting in extracted BOLD signals at regions with preponderantly neuronal cell bodies. To further group analysis by temporal concatenation of the participants' fMRIs, individual grey matter images were registered to the 3-mm-thick Montreal Neurological Institute (MNI) template using a 12-parameter affine transform. To generate the spatial wavelet transformation, we used 3D Daubechies (D2) functions up to level 3. The model order for the dynamical component in (1) was defined by the Akaike information criterion.

ICA processing
To compare the LDSTM components with ICA, PICA was performed by multi-session temporal concatenation group ICA (using MELODIC in FSL). Preprocessing included slice time correction, motion correction, skull stripping, spatial smoothing (FWHM equals to 5 mm) and temporal high-pass filtering (allowing fluctuations above 0:005 Hz). The functional images were aligned into the standard space by applying 12 degrees-of-freedom linear affine transformation, and its time series were normalized to have variance of unity. The number of components was fixed at 30 to match the distinct pattern of resting state networks (RSN) usually found by other authors [4,10]. to account for much of the signal energy. In the example, 10 % of the wavelet coefficients explain 80 % of the image energy which is two times more than the 10 % of the most powerful image domain coefficients which represent just 40 %.
The absence of artificial stochastic model constraints permitted exposing the dynamic connectivity between the identified components. Figure 7a summarizes the connectivity network estimated using PDC applied to the reconstructed system components. In addition, PDC also highlights that resting state connectivity is present mainly at low frequencies (Fig. 7b), corroborating several studies of resting state brain connectivity [4].

ICA results
Among the 30 component maps obtained by performing a PICA across all participants, 14 components were considered artifactual components due to scanner and physiological noise. Their signal variances are related to cerebrospinal fluid and white matter, head motion and large vessels. Figure 8 depicts fourteen functional components related to previously report resting state studies. They

Discussion
Local dimension-reduced modelling (LDSTM) as presented here addresses an approach to source estimation and localization in resting state fMRI data analysis that dispenses with artificial stochastic model assumptions, such as those used in classical blind source separation (principal  [3,18,19,21]). In addition to being sparse, the columns of the observation matrix act as point-spreading functions that allow system sources and their observation matrix to be identified via LSCA [32] of the whole fMRI dataset. The cortical components identified by LDSTM (Fig. 5) reflect most of the data variability and coincide with traditional resting state regions observed across different individuals, data acquisition and analysis techniques. They comprise the default mode network ðSC8Þ and brain regions involved in visual ðSC1; SC2; SC5; SC6Þ, motor ðSC13; SC14; SC7Þ and attentional functions ðSC9; SC10; SC17; SC18Þ, indicating that most of the ICA components (Fig. 8) can in fact be decomposed into several local sparse components. However, the present results draw attention to the fact that they were obtained without any additional assumption, such as source independence and/or stationarity. All that was assumed was a k spatial localization, which goes along the line of [11]'s observation that ICA effectiveness for brain FMRI is linked to their ability to handle sparse sources rather than independent ones. This could be explained by pointing out that ICA preprocessing steps involve projecting the data into a reduceddimensional subspace via the singular value decomposition which in turn confines the sources to regions of high signal variance.
PDC analysis shows a network where the information flows from regions in the superior parietal cortex (SPC) to regions in the cerebellum (CER) and anterior cingulate. As expected, the right SPC sends information to the left CER, and left SPC sends information to the right CER. Although the relationship between these structures is known, this stresses two main systems engaging in the mentioned network. The connectivity between SPC and CER is in line with recent studies showing evidence of a cerebellar-parietal network involved in phonological storage [22]. In addition, visual-parietal-cerebellar interactions are expected by following studies of effective connectivity using FMRI [20]. We also observe a network running from the left to right parietal cortex passing through the posterior cingulate. Altogether, we believe that our results provide insight into the mechanisms of how the regions of the fronto-parietal network interact. It also highlights understudied aspects of the cerebellum in this network during resting state.
In our model, LDSTM identified approximately 50 % of components in the cerebellum. This result is surprising as the rate of cerebellar components identified in resting state using ICA is below 20 % in general [4]. Some of these regions seem to be related to noise sources for being located near cerebellar arteries and veins. The components SM1, SM2, SM12, SM17 and SM18 run in the superior surface of the cerebellum near to the superior cerebellar veins, while the components SM8 and SM9 extend into the end of the straight sinus near to internal cerebral veins. On the other hand, the idea that the cerebellum should present as many components as the cortex is encouraging. Many recent fMRI studies have shown that different cerebellar regions are critical for processing higher-order functions in different cognitive domains, in the same way as it occurs in the cortex [30]. In these studies, it is worth mentioning that cerebellar clusters are always smaller than those of corresponding functionality in the cortex. We believe that some differences between ICA and LDSTM may be explained in part by the features along the domain in which they represent the sources. Since spatial wavelet analysis efficiently encodes the data neighbourhood information via a orthogonal transformation, the present method properly addresses a number of issues involving whole-brain connectivity estimation. The first one is associated to the lack of knowledge about the spatial localization of the sources. The method provides a data-driven approach to locate the main sources of data variability, thus avoiding the effects and uncertainties due to a priori region of interest delineation. The second aspect is that the new method naturally employs multi-scale transformations to create a compact model of the images, a feature of growing importance as higher-resolution images are sure to become available in the near future and whose computational processing load may be thereby substantially mitigated. Finally and most importantly, unlike ICA, the method permits deeper connectivity analysis between the identified spatial components as no independence assumption is made 'a priori'.
Various method extensions are possible, especially when it comes to estimate appropriate regularization parameter choice as a function of the amount of noise present in the data. In the present implementation, spatial noise is assumed homogeneous and normally distributed which implies a chi-squared distribution for wavelet coefficient variance. Examination of wavelet coefficients variance for real fMRI data, however, points to the need to consider heavy-tailed distributions, so that a more general approach is currently being developed to estimate wavelet domain noise variance from a finite mixture of exponential distributions that could then be used to quantify the level of data sparsity.

Conclusions
Here, an EM-based algorithm was presented for LDSTM identification. By projecting high-dimensional datasets into smoothness spaces, one can describe the system's spatial components via a reduced number of parameters. Further dimension reduction and denoising is obtained by softvector thresholding under contiguity-constrained hierarchical clustering. Finally, simulated results corroborate that the new algorithm can outperform the traditional EM approach even under mild conditions. Even with very large datasets as in the fMRI example, LDSTM shows promise in its ability to parcelate the human brain into well-localized physiologically plausible regions of spatio-temporal brain activation patterns.