Introduction

Understanding how the structure of the brain influences cognitive scores such as IQ plays a vital role in understanding the working principles of the human brain. Cognitive scores are indicators of intellectual capacity which were found to be strongly connected to social factors: while high correlation between intelligence scores measured in childhood and educational success were observed in (Colom et al., 2007; Deary et al., 2007), they were also linked to health and mortality (Gottfredson & Deary, 2004; Batty et al., 2007). Motivated by this fact, many studies have investigated how far intelligence quotients (IQ) can be predicted from the structure of the brain. It was found, for example, that cerebral volume positively correlates with cognitive ability (Reiss et al., 1996; Mcdaniel, 2005). On a finer scale, activity and global connectivity of parts of the brain, especially of the lateral prefrontal cortex, are linked to IQ (Gray et al., 2003; Woolgar et al., 2010; Cole et al., 2012; Cole et al., 2015).

Against this background, recent works have explored the possibility to predict cognitive ability scores from functional brain connectomes (Pamplona et al., 2015; Dubois et al., 2018; Dadi et al., 2019; Dryburgh et al., 2020; He et al., 2020; Jiang et al., 2020). Conventionally, connectomes are obtained from resting-state MRI and characterize the network structure of the brain; they are modeled as graphs whose nodes represent regions of interest (ROIs) and whose edges correspond to correlations in activity between these ROIs (Sporns et al., 2005). In order to achieve better generalizability across contexts and populations, (Shen et al., 2017) proposed a data-driven protocol for connectome-based predictive modeling of brain-behavior relationships, using cross-validation, to train a linear regression model. Building upon it, (Dryburgh et al., 2020) improved the results by evaluating negative and positive correlations of brain regions separately. They performed their analysis on both neurotypical subjects and subjects with Autism Spectrum Disorder (ASD) in order to investigate how neural correlates of intelligence scores are altered by atypical neurodevelopmental disorders.

Although such works achieved significant success, they mainly relied on classical machine learning approaches, which do not incorporate the graph structure of the connectomes; therefore, the local and global topological properties of the connectomes are not leveraged. (He et al., 2020) introduced graph neural networks (GNNs) (Wu et al., 2021), a subfield of geometric deep learning, where learning is customized to non-Euclidean spaces such as graphs with complex topologies (Dehmamy et al., 2019). GNNs are deep neural networks with graph convolution layers. They have already lead to significant increases in performance over existing methods in many fields. For example, they have been successfully applied to classification tasks on networks (Kipf & Welling, 2017; Qu et al., 2019), image segmentation (Qi et al., 2017), feature matching (Sarlin et al., 2020), few-shot learning (Garcia & Bruna, 2017; Kim et al., 2019), and various graph mining tasks (Schlichtkrull et al., 2018; Yun et al., 2019; Zhang et al., 2019). A very recent review on GNNs in the field of network neuroscience (Bessadok et al., 2021) examined a variety of graph-based architecture tailored for brain connectivity classification, integration, superresolution and synthesis across time and modalities. However, none of the reviewed methods were designed for brain graph regression for cognitive score prediction.

In this paper, we propose the first GNN architecture, namely RegGNN, that is specialized in regressing brain connectomes to a target cognitive score to predict. Our GNN utilizes graph convolutional layers to map input connectomes onto their corresponding cognitive scores, thereby allowing to extract the learned weights to identify the brain connectivities between anatomical regions that fingerprint the target score.

To improve the performance of the GNN, we additionally propose a novel learning-based sample selection method. It is independent from RegGNN and can be used with any architecture or regression learner. The method identifies training samples with the highest predictive power (i.e., those that are most likely to predict unseen test subjects with the lowest error); only these are then used for training. Through this, we eliminate the samples that do not increase—or even decrease—the prediction success of the model and reduce the computational resources needed for training the GNN.

Within our sample selection method, we make use of the fact that the (weighted) adjacency matrix of a functional brain connectome, when modeled as a correlation matrix, is symmetric positive semi-definite; and becomes symmetric positive definite after a simple regularization step (Dodero et al., 2015; Wong et al., 2018; You & Park, 2021). The space of SPD matrices forms a nonlinear manifold (Arsigny et al., 2006), and like (You & Park, 2021), we use a Riemannian geometric structure on it in order to obtain a natural notion of distance between two connectomes as well as tangent matrices that encode the paths that realize this distance.

We summarize the main contributions of our work as follows:

  1. 1.

    We introduce a novel, learning-based sample selection method for graph neural networks that helps to increase accuracy when predicting cognitive scores from connectomes.

  2. 2.

    We propose novel similarity measures between brain connectomes by combining notions from Riemannian geometry and topology of graphs. These measures can be used in other applications whenever we deal with objects that can be interpreted as elements of Riemannian manifolds.

  3. 3.

    We design a pipeline, consisting of RegGNN with sample selection, which outperforms state-of-the-art models in predicting full scale intelligence and verbal intelligence quotients from functional brain connectomes in an autism spectrum disorder cohort and achieves a competitive performance in a neurotypical cohort.

Methods

In this section, we detail the architecture of our RegGNN. Furthermore, we introduce our proposed sample selection method and show how we incorporate it into the training process of the GNN. To start with, we recount some facts on the Riemannian geometry of SPD matrices. Furthermore, we recall graph-topological centrality measures. The mathematical notations that we use in the following are summarized in Table 1.

Table 1 Major mathematical notations used in this paper

Preliminaries

The space of n-by-n symmetric positive definite matrices \(\text {SPD}(n) = \{\text {\textbf {P}} \in \mathbb {R}^{n,n}: \text {\textbf {P}}^{T}=\text {\textbf {P}}, \text {all eigenvalues of \textbf {P}}\) are positive} forms a cone-like manifold in the set of all matrices of the same size (Faraut & Korányii, 1994). Being a manifold, there is a well-defined tangent space at every point P ∈SPD(n), which we denote by TPSPD(n). It is a basic fact that each TPSPD(n) can be identified with the set of symmetric n-by-n matrices. Therefore, in order to avoid later confusion, we call their elements tangent matrices instead of “tangent vectors” (which is the standard term in differential geometry).

As a manifold, SPD(n) can be endowed with a Riemannian geometric structure (do Carmo, 1992). Such a structure is determined by the choice of a Riemannian metric, i.e., a smoothly varying inner product on the tangent spaces. With its help, we can measure angles between (and norms of) tangent matrices. Furthermore, it induces a distance d on the space. Consequently, geodesics can be defined as (locally) shortest paths. Like a straight line in Euclidean space, a geodesic γ that connects two points P,Q ∈SPD(n) can be represented by a unique tangent matrix \(\log (\text {\textbf {P}}, \text {\textbf {Q}}) \in T_{\text {\textbf {P}}}\text {SPD}(n)\).Footnote 1 In particular, \(\log (\text {\textbf {P}}, \text {\textbf {Q}})\) points in the direction of Q, i.e., is parallel to γ at P and has norm (measured in the one induced from the metric) equal to the distance between P and Q. Because of this, we can view \(\log (\text {\textbf {P}}, \text {\textbf {Q}})\) as the linearized “difference” between Q and P.

In contrast to Euclidean geometry, tangent matrices from different tangent spaces of a Riemannian manifold cannot be compared directly. Instead, they must be transported along curves to the same tangent space; this process is called parallel translation. This means that although tangent matrices at different points P ∈SPD(n) and Q ∈SPD(n) are symmetric matrices, we must bring them to a common point in order to compare them. The SPD space and parallel translation of vectors are illustrated in Fig. 1.

Fig. 1
figure 1

Illustration of geodesics and parallel transport of tangent matrices on the SPD cone. The dashed lines are the geodesics between the matrices P1,P2,P3 ∈SPD(n). The tangent matrices \(\tilde {\mathbf {S}}_{1,2} : = \log (\mathbf {P}_{1},\mathbf {P}_{2})\) and \(\tilde {\mathbf {S}}_{2,3} : = \log (\mathbf {P}_{2},\mathbf {P}_{3})\) are the yellow and red arrow, respectively; their parallel translations to the tangent space TISPD at the identity matrix I are S1,2 and S2,3

Since all notions depend on the Riemannian structure, we must fix one. For SPD(n), several can be found in the literature, the most popular being the Log-Euclidean metric (Arsigny et al., 2006) and the affine-invariant metric (Moakher, 2005; Pennec et al., 2006). They have been applied successfully to connectomes for classification (Dodero et al., 2015; Yamin et al., 2020), regression (Wong et al., 2018), fingerprint extraction (Abbas et al., 2021), and statistical analysis (You & Park, 2021). We choose to work with the Log-Euclidean metric because it allows for comparatively efficient algorithms. Furthermore, parallel transport does not depend on the chosen path and a unique length-minimizing geodesic exists between any two points—both properties do not hold for most other metrics.

We now recall three basic topological centrality measures for an undirectedFootnote 2 graph G: degree centrality, eigenvector centrality, and closeness centrality; we recount them in Appendix Appendix: Topological centrality measures. They measure how far a node is central to the (graph) network in the sense that most of the communication passes through it. A good reference on this is the book (Fornito et al., 2016).

We are now ready to introduce the graph neural network, and afterwards, the sample selection process.

RegGNN

Our GNN for regression, RegGNN, consists of two graph convolution layers and a downstream fully connected layer; a visualization is on the bottom left of Figure 2. In the following we denote the number of ROIs by d. Since adjacency matrices of connectomes are d-by-d correlation matrices C, they can have zero (but no negative) eigenvalues. Therefore, we can simply regularize them to being symmetric positive definite by adding a small multiple of the identity matrix I, i.e.,

$$ \mathbf{P} := \mathbf{C} + \mu \mathbf{I} $$
(1)

for some small μ > 0; see (Dodero et al., 2015) or (Wong et al., 2018). For training RegGNN—but not for the sample selection—we set all negative eigenvalues to zero, as positive correlations have been shown to be more important in brain network analysis Fornito et al. (2016). Indeed, in our experiments the results improved when negative correlations were ignored. RegGNN receives the regularized, positive adjacency matrix P of a connectome and predicts the corresponding IQ score from it by applying graph convolutions.

Fig. 2
figure 2

Illustration of the proposed sample selection strategy to train our regression graph neural network RegGNN. A) We split the data in training (green) and testing (violet) sets. B-i) The training set is divided into a train-in (yellow) and a holdout (red) set. We then extract tangent matrices for geodesics connecting elements from the train-in set (yellow) and tangent matrices encoding geodesics from elements of the train-in to elements of the holdout group (red). B-ii) The information from the tangent matrices is compressed into vectors through topological feature extraction. B-iii) With linear regression, we train a mapping f on the train-in-to-train-in feature vectors (yellow) to learn differences in target score and record for each element j of the train-in group the k elements from the holdout group for whom the predicted difference in target score to j was smallest. B-iv) For each sample in the holdout set, we count how often it was among the top k predictors for a sample from the train-in group. C) After repeating B) in an N-fold cross validation manner, the k samples (blue) with the highest accumulated top-k frequency are selected. After negative correlations have been set to zero the graph neural network is trained (only) on them. Finally, the testing set is used to evaluate the overall performance.

In the literature, there are various implementations of graph convolutions, which mainly differ by the propagation rule. Let H(i) denote the activation matrix at the i-th layer for i = 0,1,2. It is propagated to the next layer according to the general rule \(\textbf {H}^{(i+1)} = g_{i}(\textbf {H}^{(i)}, \textbf {P})\) with functions \(g_{i}: \mathbb {R}^{d,d_{i-1}} \times \text {SPD}(d) \to \mathbb {R}^{d,d_{i}}\) for i = 1,2. As initialization we choose H(0) := I. Furthermore, we choose the gi as proposed by (Kipf and Welling, 2017). Define \(\tilde {\mathbf {P}} := \mathbf {P} + \mathbf {I}\) and let \(\tilde {\mathbf {D}}\) be the diagonal degree matrix of \(\tilde {\mathbf {P}}\), we then formalize gi as follows:

$$g_{i}(\mathbf{H}^{(i)},\mathbf{P}) := \text{ReLU} (\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{P}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(i)} \mathbf{W}^{(i)}),$$

where \(\mathbf {W}^{(i)} \in \mathbb {R}^{d,d_{i}}\) is the learnable weight matrix; we chose d0 := d = 116, d1 := 64, and d2 := 1. We thus use the graph convolution layers to reduce the size of the connectomes and obtain an embedding for the brain graphs into \(\mathbb {R}^{d}\). We apply a dropout layer after the first graph convolution operation for regularization. Finally, the obtained embedding passes through a fully connected layer (linear layer) which produces a continuous scalar output. The goal of the linear layer is to embed the resulting vector containing d features into a scalar value presenting the predicted IQ score.

Learning-based sample selection

We now introduce our learning-based sample selection strategy. The underlying idea is the following. Imagine the (rather extreme) case that our subjects are clustered (possibly with outliers) in k tight groups according to their cognitive scores. Then, training a GNN on k representatives, one from each cluster, should yield good results; it should even perform better than a GNN trained on the full data set because it was not “distracted” by outliers during training. Ideally, as representatives we would choose the k most central samples of each group, i.e., those with the smallest average difference in cognitive score to the other samples. Now, since we want to predict cognitive scores from connectomes, we do not know the differences beforehand. On the other hand, existing studies validated the relationship between brain connectivity patterns and brain behavior and cognition. For instance, recent papers (Pamplona et al., 2015; Shen et al., 2017; Dubois et al., 2018; Dadi et al., 2019; Dryburgh et al., 2020; Jiang et al., 2020; He et al., 2020) have shown that the cognitive ability of a person can be predicted quite accurately from the human connectome, indicating that the brain cognitive and behavior are encoded in its connectivity to a measurable degree. Such prediction would have been elusive if similar data inputs (here brain connectomes) cannot be mapped to similar outputs (here cognitive scores). Consequently, we assume that similar brain connectivity networks are correlated in cognition whereas brain connectomes that vary in topological patterns might elicit different cognitive scores. Such hypothesis might seem somewhat reductionist as there are many other factors that contribute to molding and predicting brain cognition such as genetics and epigenetics (Goldberg & Weinberger, 2004; Deary et al., 2006; Reichenberg et al., 2009). However, such factors remain out of the scope of this study. Therefore, our idea is to use the differences between the connectomes to learn the differences between the target scores in order to identify those “representatives”. Our experiments below show that this idea of—to represent predicted local aggregations of data by (few) representatives and training only with them—generalizes well to real data.

Implementing the idea, we represent differences between connectomes by tangent matrices and assume that the difference in IQ between two subjects depends linearly on (notions deduced from) the tangent matrix \(\log (\text {\textbf {P}}, \text {\textbf {Q}})\) that encodes the geodesic between the corresponding connectomes P,Q ∈SPD(d). This model is flexible, but at the same time allows for fast computations. Our sample selection method learns this linear map, which we call f in the following, via regression and uses it to identify the k samples with the lowest predicted average difference in target score to all other samples. As motivated above, we assume that they are representative of the whole set but do not contain (most of) the outliers that hinder successful training of the GNN. The structure and terminology of our method are inspired by the work of (Errica et al., 2019).

The sample selection method consists of four steps, which are visualized in part B of Fig. 2. Given a connectome data set, these are repeated in a nested N-fold cross-validation manner to make our selection of samples more robust. In cross-validation, we split the data set into two groups: a training subset which we call train-in group, and a validation subset which we call holdout group; we perform different train-in and holdout group splits so that each sample from the training set will be in the train-in group exactly N − 1 times. We denote the (constant) sizes of the train-in and the holdout sets by ns and nh, respectively.

i) Riemannian tangent matrix derivation. For each pair of regularized connectomes \(\text {\textbf {P}}^{s}_{i}\), \(\text {\textbf {P}}^{s}_{j} \in \text {SPD}(d)\) in the train-in group, we compute the tangent matrix

$$\tilde{\text{\textbf{S}}}^{s,s}_{i,j} := \log(\text{\textbf{P}}^{s}_{i}, \text{\textbf{P}}^{s}_{j}) \in T_{\text{\textbf{P}}^{s}_{i}}\text{SPD}(d)$$

that encodes the geodesic between them and parallel translate it to TISPD(d); we denote the resulting symmetric d-by-d matrix by \(\text {\textbf {S}}^{s,s}_{i,j}\). As a result, we obtain a set of ns(ns − 1)/2 tangent matrices in TISPD(d) that represent the pairwise differences between the connectomes from the train-in group. Analogously, we get a tangent matrix \(\text {\textbf {S}}^{s,h}_{j,l} \in T_{\text {\textbf {I}}}\text {SPD}(d)\) for each pair with one sample \(\text {\textbf {P}}^{s}_{i}\) from the train-in and another sample \(\text {\textbf {P}}^{h}_{l}\) from the holdout group; this results in another set consisting of nsnh tangent matrices. The latter are the outgoing “difference matrices” from the train-in into the holdout set.

ii) Topological feature extraction from tangent matrices. The tangent matrices are still rather high dimensional, which leads to long computation times. Thus, we suggest to extract topological features in order to encode the information in more compact form. We select degree, closeness, and eigenvector centrality as well as combinations of them as our candidates for feature extraction. Note that in our case a tangent matrix represents the “difference” between two connectomes. The above features thus encode information on linearized changes in node connectivity. To the best of our knowledge, this is the first time that these notions were used in conjunction. As a result, from all in-group tangent matrices \(\text {\textbf {S}}^{s,s}_{i,j}\) as well as outgoing tangent matrices \(\text {\textbf {S}}^{s,h}_{j,l}\) we obtain feature vectors \(v^{s,s}_{i,j}\) and \(v^{s,h}_{i,j}\), respectively.

iii) Learning a linear regression mapping for predictive sample selection. We learn the linear map f via regression by training to map the vectors \(v^{s,s}_{i,j}\) corresponding to samples i and j from the train-in group to the absolute difference in target score \(|I{Q_{j}^{s}}-I{Q^{s}_{i}}|\) between them. We then apply the learned linear regression mapping f to the vectors \(v^{s,h}_{j,l}\) to predict the differences in target score between all samples j from the train-in and samples l from the holdout group.

iv) Frequency map. We record for each holdout sample \(\text {\textbf {P}}^{h}_{l}\) the k subjects from the train-in group with the smallest predicted difference under f and increment a frequency map (i.e., a counter) that is initialized at the start of the sample selection process. The frequency value of a subject is then the number of times it was one of the top k predictive samples. These frequencies give an approximated ranking whereby the top samples are closest to the largest number of other samples in (predicted) target score.

After the cross-validation is finished, we extract the top k samplesFootnote 3 with the highest cumulative frequencies. We expect these samples to have the highest representative power as they consistently predicted samples in different holdout groups with low error.

Training process

In the following, we explain how we integrate the sample selection method into the training process of RegGNN. The whole pipeline is shown in Fig. 2.

Given the data set of connectomes our proposed training pipeline consists of the following steps A-C.

A- Training-test split First, we split the data set into a training and a test set. The test set is used only for the final evaluation of RegGNN.

B- Learning-based sample selection Then, we select the top k samples with the highest representative power from the training set by applying the sample selection method from Section Learning-based sample selection.

C- RegGNN architecture for regression Finally, we train RegGNN on the top k samples using cross-validation to evaluate model generalizability against perturbations of training and testing data distributions. The final testing is done on the unseen test set.

Data and methodology

We used the pipeline from Section Training process to predict the full scale intelligence quotient (FIQ) and the verbal intelligence quotient (VIQ) from brain connectomes for both neurotypical (NT) subjects as well as subjects with autism spectrum disorder (ASD). In the following, we summarize these experiments.

Dataset We used samples from the Autism Brain Imaging Data Exchange (ABIDE) Preprocessed dataset (Craddock et al., 2013) for our experiments. It contains data from 16 imaging sites, preprocessed by five different teams using four pipelines: the Connectome Computation System (CCS), the Configurable Pipeline for the Analysis of Connectomes (CPAC), the Data Processing Assistant for rs-fMRI (DPARSF) and the NeuroImaging Analysis Kit. The preprocessed data sets are available onlineFootnote 4. To account for possible biases due to differences in sites, we used randomly sampled subsets of the available data for both cohorts; the same sets were also used by (Dryburgh et al., 2020). The NT cohort consisted of 226 subjects (with mean age = (15 ± 3.6)), while the ASD cohort was made up of 202 subjects (with mean age = (15.4 ± 3.8)). FIQ and VIQ scores in the NT cohort have means 111.573 ± 12.056 and 112.787 ± 12.018, whereas FIQ and VIQ scores in the ASD cohort have means 106.102 ± 15.045 and 103.005 ± 16.874, respectively. The brain connectomes were obtained from resting-state functional magnetic resonance imaging using the parcellation from (Tzourio-Mazoyer et al., 2002) with 116 ROIs. The functional connectomes are represented by 116-by-116 matrices, whose entry in row i and column j is the Pearson correlation between the average rs-fMRI signal measured in ROI i and ROI j.

Software All experiments are done in Python 3.7.10. We used Scikit-learn 0.24.2 (Pedregosa et al., 2011) for machine learning models and PyTorch Geometric 1.6.3 (Fey & Lenssen, 2019) for graph neural network implementations. For Riemannian geometric computations in the SPD space, we used the SPD class from the Morphomatics package of (Ambellan et al., 2021). To extract the graph topological features from the tangent matrices we used NetworkX (Hagberg et al., 2008).

Parameter settings We trained our method with Adam optimizer (Kingma & Ba, 2017) for 100 epochs with a learning rate of 0.001 and weight decay at 0.0005 based on our empirical observations. The dropout rate after the first graph convolutional layer was set to 0.1. To regularize the adjacency matrices, we used μ = 10− 10 in (1). In order to explore the parameter space for the number of selected training samples k, we varied it between 2 and 15.

Evaluation and comparison methods To test the generalizability and robustness of our method, we used 3-fold cross-validation on both NT and ASD cohorts separately for both FIQ and VIQ prediction. We report the mean absolute error (MAE) and the root mean squared error (RMSE) for all methods. For the sample selection methods, we additionally give the mean, standard deviation, minima and maxima over all tested \(k=2,\dots ,15\) to test our sample selection methods sensitivity to the selection of k.

To benchmark against our method, we chose state-of-the-art methods from both deep learning and machine learning. The first baseline was CPM (Shen et al., 2017), which was specifically designed for behavioral score prediction on brain connectomes; the second being PNA (Corso et al., 2020), which outperformed common GNNs on both artificial and real-world benchmark regression tasks (but has not been applied to brain connectomes yet). PNA comes with principal neighborhood aggregation layers that are defined similarly to graph convolution operations. They are designed to increase the amount of information that is used from the local neighborhoods in the graphs. In our experiments, we inserted PNA layers in our RegGNN architecture. We implemented both a simpler setup with sum aggregation and identity scaling only (denoted by PNA-S), as well as various aggregation (sum, mean, var and max) and scaling (identity, amplification and attenuation) methods (denoted by PNA-V) as detailed in the paper of (Corso et al., 2020). The code of both CPMFootnote 5 and PNAFootnote 6 is available online.

In order to assess the effect of the sample selection method, we also always trained each architecture on all samples as a baseline.

Evaluation of the sample selection For each architecture, we compared several methods that can be used as measure of difference in the sample selection (viz., Section Learning-based sample selection part (ii)) to train the linear mapping f.

The first class of methods was the proposed one: we encoded the differences via tangent matrices in the SPD space. To identify a good choice for handling the information that is contained in the tangent matrices, we compared several methods. As one option, we trained f on the vectorized upper triangular part (including the diagonal) of the tangent matrix; this method is denoted by (tm). Since the matrices are symmetric, ignoring the lower part speeds up computations while not losing information. Further, we used degree centrality, eigenvector centrality, and closeness centrality (see Appendix Appendix: Topological centrality measures), and applied them to the tangent matrices; they are denoted by (dc), (ec), and (cc), respectively. Note that during the process, the topology of each connectome is not altered. The mapping f was then trained on the resulting centrality vectors. Additionally, we tested whether the concatenation of the above centrality measures into a single vector is even more informative. To this end, we used both an unscaled and a scaled version, denoted by (cnu) and (cns) respectively. The unscaled version was generated by simple concatenation of the three feature vectors. However, as the three centrality measures have different ranges, we additionally tested scaling each feature vector first. For this, we used min-max scaling. Remember that min-max scaling of a vector v is defined element-wise by

$$\widetilde{v_{i}} := \frac{v_{i} - \max(v)}{\max(v) - \min(v)}.$$

Each centrality vector was scaled before concatenating, which then gave a vector with elements in [0,1] as data for the regression.

We complemented these methods with two baselines. In order to check whether the additional directional information that the tangent matrices contain helps, we also tested whether it suffices to train f on the Riemannian geometric distances \(d(\mathbf {P}^{s}_{i},\mathbf {P}^{s}_{j})\) between the connectomes from the train-in group alone; this method is denoted by (g). To assess whether we improve by using the manifold structure of the SPD space at all, we trained f on the Euclidean absolute distance between the upper triangular parts \(\widehat {\mathbf {P}}^{s}_{i}, \widehat {\mathbf {P}}^{s}_{j}\) of each pair of connectomes \(\mathbf {P}^{s}_{i}, \mathbf {P}^{s}_{j}\), i.e., on the scalars \(\|\widehat {\mathbf {P}}^{s}_{i} - \widehat {\mathbf {P}}^{s}_{j}\|_{\text {F}}\) (F standing for the Frobenius norm); we denote this method by (a).

We report the p-value between the best performing sample selection method MAE and the baseline MAE according to a t-test for all architectures.

Results and Discussion

The results for the NT and ASD cohorts are shown in Tables 2 and 3, respectively. We observe that while the state-of-the-art machine learning model CPM surpasses naive applications of GNNs in the form of PNA, our RegGNN, paired with sample selection, outperforms CPM in all tasks according to both MAE and RMSE with the exception of the NT (FIQ) task. Improvements by our method are especially visible in the ASD cohort. Interestingly, we see that the results of all methods are worse on the ASD cohort compared to the NT cohort. This was also observed by (Dryburgh et al., 2020). We hypothesize that the difficulty of predicting IQ scores in ASD cohort might be caused by the inter-subject heterogeneity that is characteristic for ASD (Tordjman et al., 2018). Another factor may be that ASD samples from ABIDE are biased towards high-functioning individuals (Craddock et al., 2013).

Table 2 Comparison of regression methods on the NT cohort
Table 3 Comparison of regression methods on the ASD cohort

We observe further that sample selection improved the results for RegGNN in all tasks except ASD (VIQ), and for CPM in the ASD cohort. For PNA based architectures, there are drastic improvements in NT (FIQ) and ASD (VIQ) and incremental improvements in the remaining other tasks. An exception to this is the PNA-V setup for ASD (FIQ), where most models with sample selection perform worse than the one trained on all samples. This might be partly explained by the more complicated structure of PNA with various aggregation models, which might demand more samples for correct training.

For all models, we see that the minimum MAE over k is lower than the MAE version that was trained on the full data sets even when the mean MAE across k is higher in all tasks. This indicates that improvements are highly likely with fine-tuning of parameter k.

Our experiments did not reveal a clear trend for the value of k for which the minimum was attained. Nevertheless, our observations show that the proposed RegGNN network is more stable to changes in the parameter K. Calculating for each architecture the average of the standard deviations (std) of the mean absolute errorFootnote 7 (see the results table) over all feature extraction methods, we first note that the averages for RegGNN are 0.455, 0.145, 0.877, 0.369 for NT (FIQ), NT (VIQ), ASD (FIQ) and ASD (VIQ), respectively. While RegGNN therefore shows small variation with respect to K, CPM is highly sensitive to the changes of this parameter with averages of 2.901, 2.613, 1.803 and 2.754 respectively. This is approximately a 2 to 10 fold increase in variability. Consequently, RegGNN can better capture the brain graph structure, whereas CPM treats graphs as flattened vectors without preserving their topological features.

Improvements to the performance of CPM are not statistically significant (p = 0.87 for ASD (FIQ), p = 0.15 for ASD (VIQ)). Similarly, we observed that improvements to the performance of RegGNN are only statistically significant in NT (VIQ) task (p < 0.05 for NT (VIQ), p = 0.98 for NT (FIQ), p = 0.64 for ASD (FIQ)). The performance increases for PNA models are more consistent, as PNA-S improved significantly in three out of four tasks (p < 0.01 for NT (FIQ), p = 0.11 for NT (VIQ), p < 0.05 for ASD (FIQ), p < 0.01 for ASD (VIQ)), and PNA-V improved significantly in two out of four tasks (p < 0.05 for NT (FIQ), p = 0.21 for NT (VIQ), p = 0.11 for ASD (FIQ), p < 0.01 for ASD (VIQ)).

Within the sample selection pipelines, the best performing methods always utilize the Riemannian geometric structure of the SPD space with respect to MAE, apart from PNA-V results for the NT (VIQ) task. In the majority of cases, the methods that rely on tangent matrices perform best with the vectorized version of the whole tangent matrix being the best method in NT (VIQ) and ASD (FIQ). We also see that the three centrality measures and concatenated versions perform well consistently. Our results do not reveal any finer pattern among the sample selection measures, but we can conclude that using Riemannian structure of connectomes to estimate their predictive power outperforms methods that do not leverage these geometric properties. Thus, other metrics should also be considered when deciding for one. The tangent matrix method seems to perform very well but is also the most time consuming since no dimension reduction is performed. On the contrary, computing centrality measures significantly reduces the size of the matrices which speeds up the process sufficiently. In our experiments, we observed that training linear regression models using tangent matrices took up to 16 times more time compared to training models using centrality measures. To understand the latter methods and how they work better, it would be helpful to analyze their behavior on the tangent matrices mathematically. In contrast to their use for adjacency matrices of graphs, this is, to the best of our knowledge, unknown. It is thus an interesting venue for future work.

An important advantage of using sample selection in training graph neural networks is the decrease in the computational power needed for the training process. Using the computational power more efficiently leads to shorter training times on fixed amount of data, which opens up opportunities to train more complex models on more data or in shorter amounts of time. While the exact time required for sample selection is heavily dependent on the hardware used and varies based on the model architecture, number of epochs in training, number of training samples, and the number k of samples to select, our observations during the experiments show that sample selection reduces the training time by 20% on average. Therefore, usage of our sample selection pipeline can enable the use of deeper neural network architectures on connectomes and provides a topic of interest in future work.

So far, we evaluated our method on a young population; however, our RegGNN demonstrated its generalizability by the utilized cross-validation strategy and across both NC and ASD brain connectivity datasets. The proposed model can be easily used to map a particular brain connectivity population (e.g., elderly population) to target scores to predict. To proliferate replication studies on other cohorts, we publicly shared our RegGNN source codeFootnote 8.

Explainability and biomarker discovery In order to identify the brain regions of interest that influence the prediction most, we extracted for each of the four tasks the learned weights of the RegGNN utilizing the best-performing sample selection method. The weights came from the fully connected layer that maps a 116-dimensional vector to the output score. (Remember that the input vector represents the learned embedding of the graph.) Thanks to the end-to-end network training, the backprogation process as well as our network design (which preserves the structure of the connectome in both the first and second layer), the learned weights in the final fully connected layer quantified the importance of its nodes in the target prediction task. Hence, a node with a higher weight in the fully connected layer is more influential for the prediction of the output score.

In Fig. 3, we show the regions of interest with the 3 highest weights averaged over k = 2,…,15; underlying is the AAL parcellation atlas (Tzourio-Mazoyer et al., 2002).Footnote 9 For the FIQ prediction task in the NT cohort, we see that the left superior dorsal frontal gyrus (SFGdor.L), right superior frontal medial gyrus (SFGmed.R), and right cerebellum 6 (CRBL6.R) have the highest weights. For the VIQ prediction task in the same cohort, left hippocampus (HIP.L), left heschl gyrus (HES.L), and left cuneus (CUN.L) possess the highest weights. In the ASD cohort, the highest weights for FIQ prediction are left insula (INS.L), left calcarine cortex (CAL.L), and right pallidum (PAL.R), while the highest weights for VIQ prediction have the left superior frontal medial gyrus (SFGmed.L) left middle occipital gyrus (MOG.L), and left cuneus (CUN.L).

Fig. 3
figure 3

The top three most relevant brain regions for each task according to the learned weights extracted from the last layer of the trained RegGNN. The colors indicate the task. NT: neurotypical subjects. ASD: autism spectrum disorder subjects. FIQ: fluid intelligence quotient. VIQ: verbal intelligence quotient

According to our results, the more important regions of interest in IQ prediction lie in the left hemisphere of the brain. Our findings are in line with other studies, that found that the insula shows greater activity in various cognitive tasks (Critchley et al., 2000) and that the surface areal change in the left cuneus correlates strongly with full IQ, especially in perceptual tasks in young adults with very low birth weight (Skranes et al., 2013). Furthermore, we observe that the left cuneus was influential in predicting VIQ in both cohorts. Finally, as (Dryburgh et al., 2020), our experiments indicate that the middle frontal gyrus is a significant region in IQ prediction.

A highly interesting question for future work is to investigate why the sample selection method improves the prediction, i.e, why there seem to be clusters within the data that can be represented by central samples. This is a challenging question that most likely requires the development of new analytical tools. Nevertheless, we think that it will be worth the effort as common structures and connections between these central samples could give us a lot more insights into the interplay between the connectivity structure of the brain and cognitive ability.

Conclusion

In this work, we applied RegGNN, a new graph neural network, to connectome data of neurotypical subjects and subjects with autism spectrum disorder to predict full scale and verbal intelligence quotients. We trained it using a novel sample selection method, which tries to identify samples within the training set that are expected to better predict the cognitive scores of new subjects. This enabled us to train the network with only 15 samples or less, while the testing performance was on par or even better than state-of-the-art methods for cognitive score prediction from connectomes. Both the sample selection and RegGNN are easy to implement in open access software and can be used in clinical practice.