Construction of embedded fMRI resting-state functional connectivity networks using manifold learning

We construct embedded functional connectivity networks (FCN) from benchmark resting-state functional magnetic resonance imaging (rsfMRI) data acquired from patients with schizophrenia and healthy controls based on linear and nonlinear manifold learning algorithms, namely, Multidimensional Scaling, Isometric Feature Mapping, Diffusion Maps, Locally Linear Embedding and kernel PCA. Furthermore, based on key global graph-theoretic properties of the embedded FCN, we compare their classification potential using machine learning. We also assess the performance of two metrics that are widely used for the construction of FCN from fMRI, namely the Euclidean distance and the cross correlation metric. We show that diffusion maps with the cross correlation metric outperform the other combinations.


Introduction
Over the past years, functional magnetic resonance imaging (fMRI) has been widely used for the identification of brain regions that are related to both functional segregation and integration.Regarding functional segregation, the conventional analysis relies on the identification of the activated voxels based on functional response models and multivariate statistics between experimental conditions (e.g.resting state vs. task-stimulated activity).A representative example is the General Linear Model (GLM) that is implemented in well established software packages such as SPM [1] and FSL [2].On the other hand, for the assessment of functional integration, there is a distinction between functional and effective connectivity [3].Functional connectivity (FC) analysis seeks for statistical dependencies (e.g.correlations, coherence) between brain regions.Effective connectivity (EC) analysis [3] tries to reveal the influence that one neural system exerts on another [3].A detailed review on the differences between FC and EC approaches can be found in [3].Here, we focus on the construction of FCN based on resting-state fMRI (rsfMRI) recordings.In rsfMRI, there is no stimuli and thus the assessment of functional integration is more complex and not so straightforward compared to taskrelated experiments [4].Furthermore, spontaneous/ resting state brain activity as measured with fMRI has been also considered as a potential biomarker in psychiatric disorders (see e.g. the review of Zhou et al. [5]).In general, for the construction of FCN, two basic general frameworks are explored: (a) Seed-based Analysis (SBA) and (b) Independent Component-based Analysis (ICA).In the SBA [6], the (averaged) fMRI signals of the regions of interest (ROI) are correlated with each other; correlations above a threshold are considered functional connections between seeds/ROIs.Even though the SBA has been proved extremely useful in identifying functional networks of specific brain regions [7,8,9], its major disadvantage is the requirement of the a-priori knowledge of the functional organization of the brain, while possible correlations between seeds can be due to structured spatial confounds (e.g.scanner artifacts) [10].Furthermore, seed selection is based on standard coordinates, while at the subject level, anatomical differences may lead to the selection of functionally irrelevant voxels at the group level.Thus, despite the use of normalization techniques, the accuracy of this approach is shown to be limited [11].On the other hand, ICA [12] has arisen as an alternative/complementary approach since the early 2000s [13,14,15].ICA decomposes the 4D fMRI data to a set of spatial components with maximum statistical independence and their associated time series.Smith et al. [16] in a meta-analytic study of 30,000 rsfmRI scans with the aid of ICA revealed a functional "partition" of the brain into resting-state Networks (RSNs), such as the Sensorimotor, Default mode and Auditory networks.Applications of ICA include also data pre-processing, where noise-related components are regressed out from the original fMRI signals [17].However, while ICA produces spatial components that are statistically independent to each other, there is no clear link between the spatial components and specific brain functions and furthermore the spatial components cannot in general be ordered by relative importance [10].Another issue is that most of the standard algorithms that compute ICs utilize gradient based optimization algorithms that use an iterative scheme; the initial guesses in these algorithms are generated randomly making the whole process stochastic: for the same dataset, the obtained spatial components may differ significantly over repeated runs [18].Thus, the robustness/reproducibility of the ICA results over repeated runs may be questioned.In order to tackle the above issues, several techniques have been proposed for the classification of ICs and the construction of subject specific ROIs [19,20].Advances have been also made regarding the selection of the model order of the ICA decomposition, such as the Bayesian dimensionality estimation technique [13] and the use of theoretical information criteria for model order selection [21].Finally, the so-called ranking and averaging ICA by reproducibility (RAICAR) [20,18] (see also [10] for a critical discussion) aims at resolving issues regarding stochasticity and robustness of the ICA decomposition.RAICAR utilizes a sufficient number of ICA realizations and based on the reproducibility of the ICs aims to rank them in terms of the most "reliable" components.Reliable ICs among realizations are assessed via correlations and the final estimate of each component is averaged.Alternatively and/or complementary to the above analysis, linear manifold learning algorithms such as Principal Component Analysis (PCA) [22,23,24] and Multidimensional Scaling (MDS) [25,26] have been exploited.PCA has been succesfully applied in the pre-processing routine for dimensionality reduction (often prior to ICA) [27].Applications of PCA, include also the recovery of signals of interest [28] and the construction of FCN from fMRI scans in task-related experiments [23,24].In these studies, the performance of PCA with respect to the detection of regions of correlated voxels has been shown to be satisfactory but not without problems.For example, a study by Baumgartner et al. [24] highlighted the limits of PCA to correctly identify activation of brain regions in cases of low contrast-to-noise ratios (CNR) appearing when signal sources of e.g.physiological noise are present [29].MDS has been also widely used in fMRI (mostly for task-based studies) mainly for the identification of similarities between brain regions in terms of voxel-wise connectivity [30,31,32,33,34,35].The implementation of MDS in neuroimaging dates back to the work of Friston et al. [26].There, it was investigated the embedded (voxel-wise) connectivity of fmRI data acquired during tasks of word generation between healthy and schizophrenia subjects.Salvador et al. [36] used MDS to investigate the embedded connectivity of anatomical regions of the brain from rsfMRI data.Benjaminsson et al. [37] used MDS to embed high-dimensional rsfMRI data from the mutual information space to a low dimensional euclidean space for the identification of RSNs.Herve et al. [38] used MDS to acquire a low dimensional approximation of interregional correlations for the investigation of the affective speech comprehension.Finally, in a meta-analytic study by Etkin et al. [39], MDS was exploited to provide a low-dimensional visualization of co-activation interrelations of ROIs.MDS has been also used in works investigating the functional (dys)connectivity associated with schizophrenia [40] and Asperger's Syndrome [41].However, thus far, only a few studies have exploited non-linear manifold learning algorithms such as Local Linear Embedding (LLE) [42], Isometric Feature Mapping (ISOMAP) [43] and Diffusion Maps [44] for the analysis of fMRI data and particularly for the construction of FCN.LLE has been applied in rsfMRI studies for the improvement of predictions in ageing studies [45] for low-dimensional clustering towards the classification of healthy subjects and patients with schizophrenia [46] and as an alternative method for dimensionality reduction before the application of ICA in task-related fMRI where non-linear relationships in the BOLD signal are introduced [47].In Anderson et al. [48], ISOMAP was employed to a benchmark rsfMRI dataset of 146 subjects for the construction of embedded low-dimensional FCN for the classification between controls and schizophrenia patients.ROIs were selected using single-subject ICA and the similarities between the ICA components were assessed using a pseudodistance measure based on lagged cross-correlation.Graph-theoretical measures were then used for classification between patients and healthy controls.Another study based on single-subject ICA, exploited ISOMAP to classify spatially unaligned fMRI scans [49].The study considered patients with schizophrenia versus healthy controls and different age groups (young/old) of healthy controls versus patients with alzheimer's disease.Despite the relatively low sample sizes, results were promising with good classification rates.Recently, Haak et al. [50] utilized ISOMAP in an effort to create individualised connectopies from rsfMRI recordings taken from the WU-Minn Human Connectome Project in a fully data-driven manner.Thus, only a handful of studies have used Diffusion Maps for the analysis of fMRI data, focused on the clustering of spatial maps of task-related experiments [51,52].Shen et al. [51] employed Diffusion Maps to separate activated from non-activated voxels.Sipola et al. [52] used Diffusion Maps with a Gaussian kernel to cluster selected fMRI spatial maps that are derived by ICA.They demonstrated their approach using fMRI recordings acquired from healthy participants listening to a stimulus with a rich musical structure.Other applications of Diffusion Maps in neuroimaging concern the epileptic-seizure prediction and the identification pre-seizure state in EEG timeseries [53,54].A review on the intersection between manifold learning methods and the construction of FCN can be found in [55] and [56].
Here, we used MDS, ISOMAP and Diffusion Maps to construct embedded FCN from single-subject ICA analysis of rsfMRI data taken from healthy controls and schizophrenia patients.For our demonstrations, we used the CO-BRE rsfMRI data that are publicly available and have been used recently in many studies [57,58,48,59].Based on key global graph-theoretical measures of the embedded graphs, we assessed their classification efficiency using several machine learning algorithms, namely Linear standard Support Vector Machines (LSVM), Radial (Radial basis function kernel) Support Vector Machines (RSVM), k-Nearest Neighbour (k-NN), and Artificial Neural Networks (ANN).We also investigated the performance of two most-commonly used metrics, namely the cross-correlation and the euclidean distance.Our analysis showed that Diffusion Maps outperformed all other methods and lagged crosscorrelation proved to be a better choice than the euclidean metric.At this point, we should note, that our study does not aim at the best classification performance by trying to find the best possible pre-processing pipe-line of the raw fMRI data and/or the selection of subjects and/or the selection of the best set of graph-theoretical measures that provide the maximum classification.Yet, we aim at using state-of-the-art manifold learning methods for the construction of the FCN and compare their classification efficiency using only a few key global theoretical graph-measures and compare the obtained results with those derived by similar studies (see e.g.[48]) using the same pipe-line for data pre-processing and single-subject ICA.To the best of our knowledge, this paper is the first to perform such a thorough comparative analysis of both linear and nonlinear manifold learning on rsfMRI data.It is also the first study to show how Diffusion Maps can be used for the construction of FCN from rsfMRI, assessing also the efficiency of two basic distance metrics, the cross-correlation-based and the Euclidean distance.

Pre-processing and signal extraction
For our demonstrations, we performed a basic pre-processing of the raw fMRI data using SPM as also implemented in other studies (see e.g.[48]).In particular, for the fMRI data processing, we used FEAT (FMRI Expert Analysis Tool) Version 6.00, part of FSL (FMRIB's Software Library, www.fmrib.ox.ac.uk/fsl).In particular, the following pre-statistics processing was applied: motion correction using MCFLIRT [60], slice-timing correction using Fourierspace time-series phase-shifting; non-brain removal using BET [61], spatial smoothing using a Gaussian kernel of FWHM 5mm, grand-mean intensity normalization of the entire 4D dataset by a single multiplicative factor.ICA AROMA [17] was also implemented to detect and factor out noise-related independent components (ICs) along with a high-pass temporal filtering at 0.01 Hz (100 seconds) that was applied after ICA AROMA procedure as it is highly recommended [17].We then proceeded with the decomposition of the pre-processed fMRI data to spatial ICs (for each suject) using the RAICAR methodology [20].In this way, we found the most reproducible spatial ICs over repeated runs as a solution to the well known problem of the variability of the ICA decomposition [18].This choice was related to the benchmark fMRIdata per se as there was only a single session per subject with relatively small duration (6 minutes); therefore we wouldn't expect a robust ICA decomposition for all subjects (see also the discussion in [10]).Another choice would be to perform group-ICA analysis, but we decided to use single-subject ICA in order to have a common ground with the methodologically similar work presented in [48].Group-ICA analysis will be performed in a future study.

Independent Component Analysis (ICA)
ICA is a linear data-driven technique that reduces the high-dimensional fMRI F (t, x, y, z) space in a set of M statistically independent spatial components (ICs).This reduction can be represented as: where F (t, x, y, z) is the measured BOLD signal, A i (t) is the temporal amplitude (the matrix A containing all temporal amplitudes is known as mixing matrix) and C i (x, y, z) is the spatial magnitude of the i-th ICA component.
While PCA requires that the principal components are uncorrelated and orthogonal, ICA asks for statistical independence between the ICs.Generally, ICA algorithms are based either on the minimization of mutual information or the maximization of non-gaussianity among components.As discussed in the introduction, most of the standard implementations of ICA, such as the one in MELODIC (Multivariate Exploratory Linear Optimized Decomposition into Independent Components) [62], which is part of FSL fMRI analysis software package, share similar gradient based optimization algorithms using an iterative scheme whose initial values are generated randomly, thus making the whole process stochastic.As a consequence, results over repeated runs may differ significantly [18].A solution to this problem is provided by the so-called ranking and averaging ICA by reproducibility (RAICAR) [20] that we briefly describe in the following section.

Ranking and averaging ICA by reproducibility (RAICAR)
The RAICAR methodology developed by Yang et al. [20] was addressed to tackle the problem of the ICs variability by performing K ICA realizations.Thus, RAICAR leads to K "slightly" different mixing matrices A 1 , A 2 . . .A K and K different sets of ICs S 1 , S 2 . . .S K .Each realization finds a fixed number M of spatial ICs.Then, a cross realization correlation matrix (CRCM) of size M • K×M • K is constructed and the alignment (ICA produces unaligned components) of ICs of each realization takes place on the basis of the absolute maximum cross correlation among components.Thus, the cross realization correlation matrix reads: R i,j with i, j = 1, 2...K are submatrices of size M ×M and their elements represent the absolute spatial cross correlation coefficients among components and across realizations.CRCM is a symmetric matrix and its diagonal consists of identity matrices which are ignored for the next steps of the algorithm.The procedure starts with the identification of the global maximum of the CRCM, thus finding the matched component based on two realizations.At the next step, the methodology seeks for the highest absolute spatial cross correlation coefficients of the identified component in the remaining realizations factoring out all others.The procedure is repeated M times until M aligned components are found.The next step involves the computation of the reproducibility index for each of the aligned components.This is done by constructing the histogram of the absolute spatial cross correlation coefficients of the upper triangle matrix of the CRCM.This histogram tends to be bimodal, as in general, we expect low spatial cross correlation among most of the ICs and high spatial cross correlation only for a few of them.A spatial cross correlation threshold is applied with the desired value lying in the valley of the histogram between the two modes [20].Finally, the reproducibility index is computed for each one of the aligned components.This is done by aggregating the supra-threshold absolute cross correlation coefficients of the CRCM for each of the aligned components.The last step of the algorithm is the ranking and averaging of the aligned components in descending order based on the reproducibility index.The selective averaging is applied so that the components are averaged if and only if, the given aligned component has at least one absolute spatial cross correlation coefficient above the threshold across realizations.
After applying RAICAR, the ICs are chosen via a cut-off threshold based on the reproducibility index (of each component) that indicates how consistent is the appearance of an IC across realizations.Here, we have set K = 30 realizations (same as also in [20]); taking more realizations did not change the outcomes of the analysis.The spatial cross correlation threshold was chosen by localizing the minimum of the histogram of absolute cross correlation coefficients of the CRCM.This threshold was specified separately for each subject.The reproducible ICs were determined by the histogram of reproducibility index over the number of components in descending order.The cut-off threshold was set as the half of the maximum reproducibility index value possible K(K−1)

2
•0.5 (this choice is the same with the one used in [20]).This cut-off threshold was set equal for all subjects.Subjects with less than 20 reproducible ICs were excluded from further analysis as this number of components resulted in disconnected graphs.Thus, we ended up with 104 subjects out of which 57 were healthy controls and 47 schizophrenia patients.

Construction of Functional Connectivity Networks
For the construction of FCN, we used all combinations between three manifold learning algorithms, namely MDS, ISOMAP and Diffusion Maps and two widely used metrics, namely the lagged cross-correlation [48,63,64] and the euclidian distance [52,65,66].

Construction of FCN based on the lagged cross-correlation
For every pair of the associated time courses of the ICs, say A i and A j , the cross-correlation function (CCF) over a maximum of three time lags (as in [48]) reads: where l is the time lag, and A i is the mean value of the whole time series.
For the construction of the FCN connectivity/ correlation matrices, we used a pseudo-distance measure d c defined as (see also [48]): The resulting dis(similarity) matrices are fully connected and therefore are hardly comparable between subjects (see the discussion in [48]).Thus, here as a standard practice, (and in all other algorithms described below), we applied thresholding to the (dis)similarity matrices in order to keep the strongest connections of the derived functional connectivity matrix.In order to factor out the influence of the variable network density on the computation and comparison of graph-theoretical measures across groups [67], we have implemented the approach of proportional thresholding (PT) [67].In particular, we considered a range of levels of PT from 20% to 70% with a step of 2%.Below the threshold of 20%, graphs became too fragmented, while thresholds above the 70% resulted in dense graphs, that mostly included noisy and less significant edges (see also the discussion in [68] about this issue).If a graph was still fragmented after thresholding, the giant component was used for further analysis.

Construction of FCN based on the euclidean distance
The euclidean distance is used in many studies to assess (dis)similarities between fMRI data points [52,65,66].For time series associated with the independent spatial maps, A i and A j , the euclidean distance reads: For the construction of FCN, PT was applied to the euclidean similarity matrices for each individual over the range of 20%-70%.

Construction of FCN with manifold learning algorithms
Below we present how MDS, ISOMAP and Diffusion Maps can be exploited to construct (embedded) FCN.

Construction of FCN with MDS
Multidimensional Scaling [25] is a linear method of dimensionality reduction that can be used to find similarities between pairs of objects in a low-dimensional (embedded) space.Given a set of M objects/observables x 1 , x 2 , . . ., x M ∈ R N , MDS produces a low-dimensional data representation y 1 , y 2 , . . ., y M ∈ R p , p ≪ N minimizing the objective function: i,j, i =j d(x i , x j ) is the (dis)similarity obtained (by any (dis)similarity measure of choice) between all pairs of points x 1 , x 2 , . . ., x M ∈ R N .In our case, the observables x i are the amplitudes of the spatial ICs A i, i=1,..M ∈ R N .
Here, N = 150 (number of time points).
The coordinates of the embedded manifold y 1 , y 2 , . . ., y M are given by: Λ p×p contains the square roots of the p largest eigenvalues, and V T p×M are the corresponding eigenvectors of the matrix: H M×M is the double centering matrix defined as: Using MDS, the dimensionality reduction of the original data Here, for the construction of the embedded FCN, we produced distance matrices (using either lagged cross-correlation or the euclidean distance) D Y of size M × M .
For the implementation of the MDS algorithm, we used the "cmdscale" function contained in the software package"Stats" in the R free Software Environment [69].

Construction of FCN using ISOMAP
ISOMAP is a non-linear manifold learning algorithm that given a set of M objects/observables x 1 , x 2 , . . ., x M ∈ R N produces a low-dimensional data representation y 1 , y 2 , . . ., y M ∈ R p , p ≪ N minimizing the objective function: where d G (x i , x j ) is the shortest path (geodesic distance) and d(x i , x j ) is the (dis)similarity obtained (by any (dis)similarity measure of choice) between all pairs of points x 1 , x 2 , . . ., x M ∈ R N .In our case, the observables x i are the amplitudes of the spatial ICs A i, i=1,..M ∈ R N .
The above minimization problem is solved as follows: [43]: • Construct a graph G = (V, E), where the vertices V are the ICs A i ; its links E are created by using either the k-nearest neighbors algorithm or a fixed distance between nodes, known as the ǫ distance.For example, a link between two ICs is created if Here, we used the k nearest neighbours algorithm with k = 3, 4, 5, 6 [43]).Set the weight w i,j of the link (if any) between A i , A j as w i,j = 1 d(Ai,Aj ) .If there is not a link set: w i,j = 0.
• Approximate the embedded manifold by estimating the shortest path (geodesic distance) d G (A i , A j ) for each pair of nodes based on the distances d i,j ; this step can be implemented for example using the Dijkstra algorithm [70].This procedure results to a matrix, D G with elements the shortest paths: • Estimate the coordinates of the low-dimensional (embedded) manifold y 1 , y 2 , . . ., y M exploiting the MDS algorithm [25] on the geodesic distance matrix D G .
Here, for the implementation of the ISOMAP algorithm, we used the package "vegan" [71] contained in the R free Software Environment [69].

Diffusion maps
Diffusion Maps [44] is a non-linear manifold learning algorithm that given a set of M objects/observables X = x 1 , x 2 , . . ., x M ∈ R N produces a low-dimensional representation Y = y 1 , y 2 , . . ., y M ∈ R p , p ≪ N , addressing the diffusion distance among data points as the preserved metric [72].The embedding of the data in the lowdimensional space is obtained by the projections on the eigenvectors of a normalized graph Laplacian [73].The Diffusion Maps algorithm can be described in a nutshell in the following steps: • Construction of the affinity matrix W M×M , here M is the number of ICs for each subject.The elements W ij represent the weighted edges connecting nodes i and j using the so-called heat kernel: where x i is a N -dimensional point (here, N=150), d(x i , x j ) are the (dis)similarities obtained (by any dissimilarity measure of choice) between all pairs of points x 1 , x 2 , . . ., x M ∈ R N and σ is an appropriately chosen parameter which can be physically described as a scale parameter of the heat kernel [44].The heat kernel W satisfies two important properties, the one of symmetry and the other of positive semi-definite matrix.The latter property is crucial and allows the interpretation of weights as scaled probabilities of "jumping" from one node to another.The parameter σ of the neighborhood size is data-dependent and here, it was determined by finding the linear region in the sum of all weights in W, say S w , using different values of σ [74,52].S w is calculated through the formula: In order to use a single value of σ for all participants, we computed a super-distribution of the sum of weights across subjects (taking the median value of the distributions) using different values of σ.Thus, we considered values of σ lying in the linear region of the super-distribution.After inspection, every value of σ was lying in the linear region of every single subject's logarithmic plot.
• Formulation of the diagonal M × M normalization matrix K along with the diffusion matrix P: Each element of the symmetric and normalized diffusion matrix P reflects the connectivity between two data points x 1 and x 2 .As an analogy, this connectivity can be seen as the probability of "jumping" from one point to another in a random walk process.Consequently, raising P to a power of t can be thought as a diffusion process.As the number of t increases, paths with low probability tend to zero, while the connectivity between paths with high probability remains high enough, thus governing the diffusion process [44].Thus, the algorithm of Diffusion Maps preserves the diffusion distance among points in a low-dimensional euclidean space.The diffusion distance is closely related to the diffusion matrix P and for two distinct points x i , x j and for specific time instance t is defined as [75]: Unlike geodesic distance, the diffusion distance is robust to noise perturbations, as it sums over all possible paths (of t steps) between points [44].
• Construction of the conjugate matrix substituting Eq.( 14) to Eq.( 16) we get This is the so-called Graph Laplacian matrix [73].The matrix P is adjoint to the symmetric matrix P. Thus, P and P share the same eigenvalues [76].
• Singular Value Decomposition (SVD) of P yields where Λ is a diagonal matrix containing the M eigenvalues of P and U the eigenvectors of P. The eigenvectors V of P can be bow found by [76]: • By taking out the trivial eigenvalue λ = 1 of the matrix Λ and the corresponding eigenvector contained in V, the coordinates of the low dimensional embedded manifold y 1 , y 2 , . . ., y M are given by: where Λ p×p contains the p largest eigenvalues, and V T p×M are the corresponding eigenvectors of the diffusion matrix P.
For the implementation of the above algorithm we used the package "diffusionMap" [77] contained in the R free Software Environment [69].

Choice of the embedded dimension
Here, the embedded dimension was determined via the eigenspectrum of the final decomposition for every dimensionality reduction/Manifold learning algorithm.If there is a gap between the first few larger eigenvalues of the final decomposition and the rest of the eigenspectrum, these few eigenmodes capture most of the distance differences between data points and are able to represent and uncover intrinsic properties of the data structure [76,78,79].In order to determine the embedded dimension for the methods described above, we considered the following steps: We sorted the eigenvalues in decreasing order eg.
Then, for each subject, we calculated the consequent pairwise differences λ 1 − λ 2 , λ 2 − λ 3 , ... , λ M−1 − λ M .A gap between the mean (over all subjects) differences determines from which point and further, the relative contribution of another eigendimension is redundant (this having a small contribution to the reconstruction of the embedded FCN).Thus, here for our computations, we considered low-dimensional embeddings of p = (2, 3, 4, 5) dimensions for each one of the manifold learning methods described above.

Graph-theoretical measures
Here, we analyzed the topological properties of the binary FCN graphs on the basis of three major graph measures, namely, the average path length, the global clustering coefficient, and the median degree, which are the most frequently used in neuroscience [80].In particular, given a graph G = (V, E) with g ij representing the link (0: unconnected or 1: connected) from node i to node j and k i = NV j=1 g ij the degree of node i, the graph measures are computed as follows: a) The average path length is defined by: L = 1 NV (NV −1) i =j D Gij , i.e. is the average number of steps along the shortest paths D Gij for all possible pairs of the network nodes.This is a measure of the efficiency of information or mass transport on a network between all possible pairs of nodes.b) The global clustering coefficient is defined by: C g = tc t , where t is a triplet and t c is a closed triplet.A triplet of a graph consists of three nodes that are connected by either to open (i.e open triplet) or closed (i.e closed triplet) ties.In general, this measure indicates how random or structured a graph is (in our case, in terms of functional segregation).c) The median degree M k is the median value of the degree distribution of G.This measure reflects how well connected is the "median" network node in terms of the number of links that coincide with it.An extensive review of definitions and meaning of the above and other graph-theoretical measures with respect to brain functional networks can be found in [81,80].The computations for the graph analysis were performed with the "igraph" [82] package contained in the free software environment R [69].

Classification Algorithms
Based on the three key graph measures, the classification was assessed using machine learning algorithms, namely Linear Support Vector Machines(LSVM), Radial Support Vector Machines (RSVM), Artificial Neural Networks (ANN) and k-NN classification (for a brief description of the above algorithms and their parameters see in the Appendix).The classification algorithms were trained, validated and tested using a 10-fold cross-validation scheme which was repeated 100 times.Thus, we separated the data in ten separate sub-samples; nine of them were used as training sets and one of them was used for validation.This process was repeated 10 times leaving out each time a different sub-sample which served as a validation set.The whole procedure was repeated 100 times.The overall classification rate was determined via the computation of the average classification rate over all the repetitions of the 10-fold cross validation for each model.The average confusion matrix (over all repetitions of the 10-fold cross validation) was also computed for each classification model.The confusion matrix is a 2 × 2 (in the case of binary classification) square matrix containing all true positives T P , false positives F P , true negatives T N and false negatives F N .Here, we considered as positives P the schizophrenia cases and as negatives N the healthy control cases.Sensitivity (also called the True Positive Rate) and Specificity (also called the True Negative Rate) are basic statistical measures for the assessment of binary classifications.The sensitivity T P R is given by T P R = T P T P +F N , while the specificity T N R is given by T N R = T N T N +F P .Here, sensitivity characterizes the ability of the classifier to correctly identify a schizophrenic subject, while specificity is the ability of the classifier to correctly identify a healthy subject.Here, we used the algorithms contained in the package "caret" [83] contained in the R free software environment [69].

Classification performance using the lagged cross-correlation metric
In table 1, we present the best classification accuracy, along with the corresponding sensitivity and specificity rate for each manifold learning algorithm, threshold and classifier when using the lagged cross-correlation metric.At the end of the table 1, we provide also the results obtained by the "conventional" (thresholded) cross-correlation matrix.Fig. 1 shows the super-distribution of the sum of weights of all subjects against different values of σ used for construction of the FCN using the Diffusion Maps algorithm; the red dotted vertical line shows the optimal σ (here, σ = 0.325) while the black vertical lines bound the linear region (σ ∈ (0.25, 0.35)).The results were robust to different choices of the time step of the Diffusion Maps algorithm, namely t = 0, 1, 2. The investigation of the optimal σ was performed with respect to the best classification accuracy over all classifiers.Fig. 2 depicts the best classification rates for MDS, ISOMAP, Diffusion Maps and lagged-cross correlation matrix for different classifiers are reported.Diffusion Maps provided the best classification accuracy (79.3% for SVM with an RBF kernel and 52% PT), thus appearing more robust over a wide range of PT.With respect to the maximum classification accuracy obtained by Diffusion Maps, results were robust over a wide range of values of σ ∈ (0.28, 0.35).Isomap performed better for low thresholds.Overall, the performance of Isomap was more or less the same with the cross-correlation matrix with the best classification rate being slightly higher for Isomap (74.4% for SVM with an RBF kernel and 24% PT).Its performance was however sensitive to the choice of the number k of nearest neighbors; with k = 5 we got a 74.4% classification accuracy for SVM with an RBF kernel and 24% PT) while for k = 6 we Table 1: Best classification rates over all manifold and machine learning methods using the lagged cross-correlation pseudo-distance measure d c (see 2.3.1); the embedded dimension (Emb.Dim) is also shown for each method along with the corresponding best threshold point (Thres), Classifier, Accuracy (Acc), Sensitivity (Sens) and Specificity (Spec) rate.

Method
Emb

Classification performance using the euclidean distance
The same analysis was performed for the euclidean distance.The best classification rates using the euclidean distance for all manifold learning methods and classifiers are presented in table 2.
Overall, the classification rates with the euclidean distance were more noisy compared to the ones computed with the lagged cross-correlation distance.
For a wide range of PT, the classification rates obtained were under 60% and sometimes at the level of a random or a completely biased classifier (towards the larger class, here, the healthy controls).In our case, a random classifier resulted to a 50.5% classification accuracy, while for a strict majority rule for imbalanced datasets, a completely biased (and thus useless) classification model results to a 54.8 % classification rate.
Fig. 5 depicts the accuracy of all methods across all thresholds using all different classifiers.
In this case, the best classification rate was obtained with Isomap (72.9% for RSVM and 68% PT).Despite the fact that for certain thresholds, the maximum classification rates obtained with the euclidean distance were enough high, the performance of the methods was generally poorer compared to the ones computed with the lagged cross-correlation distance for all the range of PT.
Characteristic eigenspectrums for all manifold learning algorithms utilizing the euclidean distance are shown in Fig. 6.
Finally, in Fig. 7 are given the boxplots of the classification rates among metrics used.Overall, the lagged crosscorrelation distance resulted in higher classification rates compared to the euclidean distance with the exception of MDS.Except from some specific thresholds, MDS in both cases performed poorly.

DISCUSSION
In this study, we constructed embedded FCN from rsfMRI data using linear and non-linear manifold learning techniques.Based on the graph theoretical measures of the constructed FCN we then used machine learning algorithms for classification purposes.We also compared the performance of two widely used metrics, namely the lagged crosscorrelation and the euclidian metric.For our illustrations, we used a publicly available dataset of resting fMRI recordings taken from healthy and patients with schizophrenia.This is the first study that performs such a systematic comparative analysis between various manifold learning algorithms, machine learning algorithms and metrics.To the best of our knowledge, it is also the first study that shows how Diffusion Maps can be exploited to construct FCN from fMRI data.At this point we should note that our intention was not to try to obtain the best possible classification performance by "optimising" the pre-processing of the raw fMRI data and/or by trying to find the best set of graph-theoretical measures.
Instead we used a very basic pre-processing of the raw data (as the one performed in [48]), while classification was based only on three key graph-theoretical measures in neuroscience [80], namely, the average path length, the global clustering coefficient and the median degree of the embedded binary networks.Based on the above, our best reported accuracy was 79.3 % obtained with Diffusion Maps and lagged cross-correlation (using 10 fold cross validation repeated 100 times).
For the same benchmark fMRI dataset Anderson and Cohen [48] used ISOMAP for the construction of embedded low-dimensional FCN for the classification between controls and schizophrenia patients.ROIs were acquired using single subject ICA and functional connectivity was accessed using the lagged cross-correlation distance.The analysis revealed differences in small world properties among groups and 13 graph theoretic features led to a reported 65% accuracy rate.Algunaid et al. [68] reported a 95% accuracy (with FSV feature selection method and 82.5% with a Welch's t-test) testing more than 360 graph-based features.AAL was used for signal extraction and ROI selection, while an SVM-based classifier was used along with 10 fold cross validation scheme to evaluate its performance.An important outcome of our analysis is that the Diffusion Maps result in more robust results and higher classification accuracy and that the lagged cross-correlation distance outperforms the euclidean distance which is used traditionally to construct connectivity matrices from fMRI data.In a future work, we aim at analysing the local-graph theoretical properties, thus matching the derived RAICAR ICs with known resting state networks and also perform group ICA analysis.
For our analysis, we used the "R" software subroutines as described in 2 and 7.

APPENDIX
For classification we used Linear Support Vector Machines (LSVM), Radial (Radial basis function kernel) Support Vector Machines (RSVM), one hidden layer Artificial Neural Networks (ANN) and k-NN classifier(k-NN).All classifiers were trained and evaluated via repeated 10-fold cross-validation scheme repeated 100 times.For the classification we used the three key graph theoretical measures as described in subsection 2.5 in Materials and Methods.Training and classification were implemented using algorithms contained in package "caret" [83] publicly available in R free software environment [69].

Support Vector Machines (SVM)
Support vectors machines (SVM) aim at finding the optimal separating plane or hyperplane in the feature space among groups.In particular, for a set of points (x i , y i ) i=1,2..N , where N is the number of subjects, x i ∈ R d contains d attributes/features selected for subject i and y i ∈ (−1, 1) the subject's class (here, either healthy or patient), SVM attempts to find the optimal plane or hyperplane that divides the two classes by maximizing the margin of separation.
Any hyperplane with a given set of points x i can be modelled as w • x i + b = 0 where w represent the weights of features x i .Parallel hyperplanes can be described as w • x i + b ≥ 1 if y i = 1 and w • x i + b ≤ −1 if y i = −1.The optimization problem then aims at maximizing the margin between hyperplanes 2 w such that for every (y i ) i=1,2..N , y i • (w • x i + b) ≥ 1.One can take advantage of a regularization parameter C indicating the penalty of error z i that gives a trade-off between misclassifications and the width of the separating margin.This leads to the final optimization problem, which minimizes w 2 2 + C • i z i subject to y i • (w • x i + b) ≥ 1 − z i , i = 1, 2..N .Based on the idea that the data maybe better separable in a higher dimensional space, SVM may utilize a kernel function to map x i ∈ R d to φ(x i ) ∈ R D , D > d.In our study, besides standard Linear SVM (LSVM), we also used Radial SVM (RSVM) making use of the Radial basis functions kernel given by K(x i , x j ) = exp(− ), where γ is the kernel's scale parameter.

k-Nearest Neighbours classifier(k-NN)
k-nearest neighbours algorithm is one of the simplest classification/machine learning algorithms.Given (x i , y i ) i=1,2..N , where N is the number of subjects, x i ∈ R d contains d attributes/features selected for subject i and y i the subject's class (here, either healthy or patient), k-NN utilizes euclidean distance in the feature space to perform a voting system among κ closest neighbours.In this manner, each point is classified as "control", if the number of"control" neighbours is greater than the number of "patient" neighbours and inversely.The number κ of closest neighbours is a parameter of choice that plays a crucial role in method's performance.In this study, it is important to note that we chose odd values of κ (i.e how many neighbours we take into consideration) in order not to have to break possible ties in the voting system among neighbours

Artificial Neural Networks (ANN)
In this study, we also used feed-forward Artificial Neural Networks (ANN) consisting of one hidden layer.The input units were three as the number of the features considered for classification.We have chosen one hidden layer consisting from 1 to 5 neurons along with a bias term.The activation function used for all neurons was the logistic transfer function [84].The output was one node (reflecting simple binary classification control/patient).The training procedure of the model was done via back-propagation [85] using a 10-fold cross validation scheme repeated 100 times.Finally a weight decay parameter a (regularization parameter) was used to prevent over-fitting and improve generalization [86] of the final model.For the implementation of the ANN we used the "nnet" software package [87] publicly available in R free software environment [69].

Fig. 3
Fig. 3 depicts the classification performance obtained with Diffusion Maps for different values of σ.

Figure 1 :
Figure 1: Super-distribution of all subjects of the sum of the weights (see the equation in 12) (e.g. by taking the median value of the distribution across subjects per value of σ) using different values of σ parameter.The red dashed vertical line shows the optimal σ that was found to be σ= 0.325.The other two vertical black lines depict the linear zone in which we investigated values of σ.

Figure 2 :
Figure 2: Overall classification performance for all thresholds (from 20%-70% of the strongest edges with 2% as pace) and classifiers.The metric used is the lagged cross-correlation based pseudo-distance measure d c (see 2.3.1).

Figure 3 :
Figure 3: Classification performance of Diffusion Maps for different values of σ and classifiers.The metric used is the pseudo-distance measure d c (see 2.3.1)based on lagged cross-correlation.

Fig. 4
Fig.4shows characteristic eigenspectrums of the final decomposition for MDS, ISOMAL and Diffusion Maps.As it is shown, there are three gaps: the first gap appears between the first eigenvalue and the rest of the spectrum, the second gap between the first two eigenvalues and the rest of the spectrum, and a third gap appears between the first four-five eigenvalues and the rest of the spectrum.

Figure 4 :
Figure 4: Mean Differences of the first larger 15 eigenvalues (see 2.4.4) for all manifold learning algorithms with the cross correlation metric.A) MDS (see 2.4.1),B) ISOMAP (see 2.4.2) and C) Diffusion Maps (see 2.4.3), based on the optimal parameters.The red dashed vertical line marks the maximum number of dimensions considered in this study (i.e.five dimensions).

Figure 5 :
Figure 5: Classification performance using the euclidean distance (see 2.3.2) for all thresholds (from 20%-70% of the strongest edges with 2% as pace) and classifiers.

Figure 6 :
Figure 6: Mean Differences of the first larger 15 eigenvalues (see 2.4.4) for all manifold learning algorithms with the euclidean metric.A) MDS (see 2.4.1),B) ISOMAP (see 2.4.2) and C) Diffusion Maps (see 2.4.3) using the optimal parameters.The red dashed vertical line shows marks the maximum number of dimensions considered in this study (i.e.five dimensions).

Figure 7 :
Figure 7: Boxplots of classification rates over all classifiers and thresholds, using the A) lagged cross-correlation pseudo-distance d c (see 2.3.1).B) the euclidean distance L 2 (see 2.3.2).The labels at the bottom of each panel correspond to the method used for the construction of the FCN: the Cross Correlation matrix (Corr.Matrix), Diffusion Maps (DMaps), ISOMAP, MDS.The black horizontal lines mark the median values of the distributions.

Table 2 :
Best classification rates over all manifold learning methods and classifiers with the use of the euclidean distance L 2 (see 2.3.2); the embedded dimension (Emb.Dim), threshold point (Thres), classifier, Accuracy (Acc), Sensitivity (Sens) and Specificity (Spec) rates.