Introduction

Connectomes or braingraphs are compact and focused derivatives of the diffusion magnetic resonance images (MRIs) of the brain: their vertices are labeled by the anatomical areas, and two such vertices are connected by a weighted graph-edge, if a tractography workflow Besson et al. (2014) finds neural tracks between the areas, corresponded to the vertices. By focusing on the connections between cerebral areas instead of analyzing the whole MR image, we can make use of the rich and refined resources of graph theory, born with the famous article of Leonhard Euler on the problem of the Königsberg Bridges Euler (1741) in 1741.

Our research group earlier has prepared several undirected and directed braingraph sets (Kerepesi et al. 2016, 2017; Szalkai et al. 2015a, 2017a, 2019a) from the 500 Subjects Data Release McNab et al. (2013) of the Human Connectome Project (HCP). The resulting graphs were made available at the site https://braingraph.org, and were applied in several structural studies of the human brain (Szalkai et al. 2015b; Kerepesi et al. 2018a; Szalkai et al. 2019a; Kerepesi et al. March 2018b; Szalkai et al. Feb 2019b, 2018; Szalkai et al. 2017b; Szalkai et al. 2016; Fellner et al. 2019, 2020a, 2020b).

In the present contribution we describe a new braingraph set, computed from the 1200 Subjects Data Release of the Human Connectome Project McNab et al. (2013). The set contains 1064 connectomes, each in five resolutions, and each edge is weighted by three different weight functions. Our dataset may serve as a robust resource for the computational neuroscience community in the coming years.

Methods

The data source of the workflow is the 1200 Subjects Data Release of the Human Connectome Project (HCP) McNab et al. (2013), documented at the site https://www.humanconnectome.org/study/hcp-young-adult/document/1200-subjects-data-release. For the present study the “re-preprocessed” 3T diffusion data was applied, as was detailed at the HCP site.

The Connectome Mapper Tool Kit (CMTK) workflow Daducci et al. (2012) was utilized in the graph computation on the HCP data. For each subject, we have applied the segmentation and the parcellation steps only once, but the probabilistic tractography part of the workflow 10 times. The parcellation scheme was the Lausanne2008 atlas, the labels applied are listed in https://github.com/LTS5/cmp_nipype/blob/master/cmtklib/data/parcellation/lausanne2008/ParcellationLausanne2008.xls.

The graph construction was performed in the following steps:

  1. 1.

    For each subject the MRtrix 0.3 tractography algorithm Tournier et al. (2012) was run, with probabilistic seeding and probabilistic tractography. The number of streamlines was set to 1 million. For defining the graph edges, let us consider two distinct, anatomically labeled areas of the cortical- or sub-cortical gray areas of the brain, denoted by A and B. If the tractography algorithm found at least one streamline between the area A and B, then vertex a, representing area A was connected to vertex b, representing area B, by a graph edge. The three weights of \(\{a,b\}\) give the number of streamlines or fibers found between areas A and B, the average length of the streamlines, and the mean fractional anisotropy of the streamlines.

  2. 2.

    Step 1 was repeated 10 times for each subject. We accepted \(\{a,b\}\) to be an edge of the connectome of the subject, if it was present in all ten graphs computed in the repetitions. Next, for each edge we computed the maximum and the minimum number of the fibers, defining that edge, and deleted those two extremal values. Consequently, there remained 8 fiber numbers for each edge. We computed the mean value of those fiber numbers, the mean value of the lengths of the streamlines and the fractional anisotropies for the three weights of the edge.

In other words, the probabilistic tractography was performed 10 times, the graphs were constructed after each run, (i.e., 10 graphs were constructed for each subject), next the extremal fiber number values were deleted, the remaining 8 values were averaged, and the edges, which were present in all 10 graphs were allowed to be included in the resulting graph.

Steps 1 and 2 were performed only in the highest (i.e., the finest) resolution with 1015 vertices. For lower resolutions, the graphs were computed from the 1015-vertex graph by contracting vertices, summing the fiber numbers of the multiple edges between the two contracted vertices and contracting the multiple edges.

On the choice of 10 as the repetition number of the probabilistic tractography we refer to the detailed analysis in the “Discussion and results” section below.

From the dataset of the HCP website we were able to finish the graph computations for 1064 subjects.

The computation was done on our 24-member Intel i7 cluster (each with 6 physical and 12 virtual CPU cores and 16 GB of RAM) within 3 weeks running time.

Data records

The data source of this work was published at the Human Connectome Project’s website at http://www.humanconnectome.org/McNab et al. (2013) as the 1200 Subjects Public Release. The parcellation data, containing the anatomically labeled ROIs, is listed in the CMTK nypipe GitHub repository https://github.com/LTS5/cmp_nipype/blob/master/cmtklib/data/parcellation/lausanne2008/ParcellationLausanne2008.xls.

The braingraphs, computed by us, can be accessed at the https://braingraph.org/cms/download-pit-group-connectomes/ site, by selecting one of the download options, denoted by “X nodes set, 1064 brains, 1 000 000 streamlines, 10x repeated”, where \(X=86, 129, 234, 463, 1015\).

The graphs are given in GraphML format, described in https://cmtk.org Daducci et al. (2012). Each file begins with an attribute definition section, then the nodes are described with their coordinates and anatomical labels, corresponding to the parcellation at https://github.com/LTS5/cmp_nipype/blob/master/cmtklib/data/parcellation/lausanne2008/ParcellationLausanne2008.xls.

Next the (un-directed) edges are listed. The edges carry three weights:

  • The number of fibers;

  • The mean value of the fiber lengths in the edge;

  • And the mean fractional anisotropy of the fibers

Note that the edge weights are averages from the eight of the ten tractography-runs, therefore, even the fiber number is—typically —a non-integer.

Discussion and results

Here we describe the workflow, which implied the choice of the 10 repetitions of step 1 in the graph construction above. We note that the present section describes only the process, resulting the specific choice of the repetition number 10, and not the actual graph construction (which was already duly described in the “Methods” section).

The implementations of the deterministic tractography algorithms also contain a probabilistic seeding step; i.e., two runs of these tractography computations almost always yield different results. When we use probabilistic tractography Girard et al. Sep (2014); Buchanan et al. Feb (2014), it is evident that distinct runs yield different results.

For generating reproducible results in the graph construction with a probabilistic tractography phase, it is a natural idea to repeat the probabilistic tractography algorithm for the very same input several times, and to average the results of the tractography in a careful way.

Let us fix two vertices, and let the random variable X denote the number of fibers discovered between then, then, clearly, for any X: \(E(X-E(X))=E(X)-E(X)=0\), that is, the expectation of the difference of X from its expected value E(X) is 0. This fact implies that the repetitions and the averaging will increase the reliability of the tractography results.

For the determination of the number of repetitions k, with the trade-off with practical computability and robustness, we have followed the strategy, described as follows. In short, we determined the number of necessary repetitions by comparing deviations for 10 average values, each for k repetitions, for \(k=1,2,\ldots ,50\).

More exactly, we have chosen 9 subjects: for each non-zero leading digits of the ID numbers, one was chosen randomly (the choices were: 136631, 200008, 300618, 401422, 500222, 601127,700634, 800941, 901038). For a given subject, and a given positive integer value k, we have generated the following ten braingraphs:

$$\begin{aligned} {G_k}_1, {G_k}_2, \ldots {G_k}_{10}, \end{aligned}$$

where \({G_k}_i\) was calculated by k repetitions of the tractography phase, and averaging the numbers of fibers for each edge on the k runs.

For \(i=1,2,\ldots ,10\), we have generated independent k instances, and averaged these k fiber numbers for each edge. Next, we have thrown out those edges, which were not present in all the ten copies of the averaged graphs. Now, for each remaining edge \(\{u,v\}\) of the graph G, we computed the average fiber number values over k repetitions: one average value \(w^{(k)}_i(u,v)\) for each i in \({G_k}_i\), for \(i=1,2,\ldots ,10\). For readability, we omit (uv) from \(w^{(k)}_i(u,v)\) in what follows.

For these ten \(w^{(k)}_i\) values we computed the relative standard deviation (also called coefficient of variation) of the ten \(w^{(k)}_i\) values:

$$\begin{aligned} c_v(w^{(k)})={\sigma (w^{(k)})\over \mu (w^{(k)})}, \end{aligned}$$
(1)

where

$$\begin{aligned} \mu (w^{(k)})={ 1\over 10}\sum _{i=1}^{10}w^{(k)}_i, \ \ \sigma (w^{(k)})=\sqrt{{1\over 9}\sum _{i=1}^{10} (w^{(k)}_i-\mu (w^{(k)}))^2} \end{aligned}$$
(2)

Figure 1 displays the change of the relative standard deviation of the fiber number of a given edge (the edge, connecting vertex 19 and vertex 21 in the 463-vertex resolution in the case of subject No. 901038) for \(k=1,2,\ldots ,50\).

Fig. 1
figure 1

The change of the relative standard deviations (on the y axis) of the edge, connecting vertex 19 and vertex 21 in the 463-vertex resolution in the case of subject No. 901038, for \(k=1,2,\ldots ,50\), (on the x axis)

Figure 2 shows the change of the relative standard deviations, averaged for all edges as a function of k, in the case of a given braingraph, in 234-vertex resolution. Supporting Figures 1, 2, 3 and 4 show the same in graphs of different resolutions.

Fig. 2
figure 2

The change of the relative standard deviations (on the y axis), averaged for all edges as a function of \(k=1,2,\ldots ,50\) (on the x axis), in the case of the connectome of subject No. 300618, in 234-vertex resolution. The medians of the relative standard deviations are visualized by red horizontal lines, while the boxes show the middle-half of the datapoints: under the box there are the lower quarter-, above the box the upper quarter of the data points. The solid lines show the whole spread of the data points

Based on the visual examination of Figure 2 (and the related figures for other resolutions and subjects, cf. Supporting Figs. 1, 2, 3 and 4), we have chosen the \(k=10\) value for repetitions as a good trade-off between deviation and practical computability: for repetitions \(k>10\) the decrease of the red horizontal lines, showing the median relative standard deviations, is very small on Fig. 2 and Supporting Figs. 1 and 2, and still small on Supporting Figs. 3 and 4.