Keywords

1 Introduction

Diffusion magnetic resonance imaging (dMRI) is currently the method of choice for assessing the local microstructure in the white matter (WM) of the human brain in vivo. Tractography methods use this local microstructure to generate streamlines aiming at modeling the underlying anatomy of neural fibers in the brain. These streamlines are locally aligned with the estimated tissue orientation. The set of obtained streamlines is usually referred to as a tractogram and is the basis for different subsequent analyses.

Tractography has potential for clinically-relevant applications. For instance, visualizations of tractography data are used by neurosurgeons with the goal of preserving important neural bundles during brain tumor resection [9, 16, 64]. Moreover, it is possible to group streamlines into bundles and perform analyses based on these. This so-called bundle-based or tract-based analysis (BBA/TBA) allows to compute statistics for individual bundles, either averaged or along the streamlines. This can be used to perform group comparisons [8, 10, 60, 70] or to analyse WM changes in patients over time [5]. Focusing on the tractogram as a whole, the concept of structural connectivity is employed for the analysis of how pairs of cortical and subcortical regions are connected through streamlines of the tractogram [49, 59]. These connections can be summarized in graphs, which can be seen as estimations of the actual anatomical connections between GM regions through neural pathways in the WM. Such graphs serve as the basis for different group comparisons. For example, this approach has been applied to different neurological diseases, including among others epilepsy (e.g., [21]), multiple sclerosis (e.g., [31]) and Alzheimer’s disease (e.g., [17, 41]), as well as for assessing differences in the brain due to normal aging (e.g., [13]).

Despite its potential in the aforementioned applications, the validation of tractography is an open challenge in the field. Several recent studies have unveiled that state-of-the-art methods to construct tractograms suffer both from false positives (FP), i.e., streamlines that are not related to anatomical structures [32], and false negatives (FN), i.e., missing streamlines to accurately describe known WM anatomy in its whole extent [2]. At the same time, findings indicate that tractography results are to some extent reproducible [34]. That means that FPs are an intrinsic problem of current methods [32]. Already in 2014, Thomas et al. hypothesised that there are inherent limitations of tractography that lead to effects of both, FPs and FNs [62]. The importance of these two problems is application dependent. For example, it has been argued that for structural connectivity analysis, a high specificity is of higher importance than sensitivity [71]. This means that, for this specific application, reducing FPs is more important than reducing FNs.

Despite the efforts in recent years, new approaches have still not overcome the limitations of tractography methods [53]. In the particular case of FP, an alternative to improving tractography methods is to remove the FPs from the tractogram in a post processing step. In the following, we refer to this approach as tractogram filtering. Several methods from the literature can be used for this goal. Even though some methods were not designed for tractogram filtering, they can be adapted for this purpose. In this chapter, we consider tractogram filtering as a binary classification problem in which we aim at assigning either a positive (P) or a negative (N) label to each streamline. This allows us to assess the possibility to use a particular method to define the labels P and N for, ideally, separating true positive (TP) and FP streamlines in a tractogram. In the following, we review the most relevant methods in literature, point out their issues and pose key challenges on the way to improved tractogram filtering.

The chapter is organized as follows. Section 2 describes the main approaches and methods for tractogram filtering. Section 3 describes the key challenges of tractogram filtering and gives a perspective on future developments in the field. Finally, we make some concluding remarks in Sect. 4. In order to illustrate specific properties or issues of the different strategies, we run some experiments on a very limited amount of data from the Human Connectome Project (HCP).

2 Approaches for Tractogram Filtering

In the following sections, we review methods that can be used for tractogram filtering. Based on the definition of the labels P and N for each streamline, we identified four main criteria that these methods build upon, namely: explainability of the diffusion signal, inclusion and exclusion regions of interest (ROIs), streamline geometry or shape, and streamline similarity and clustering. Figure 1 shows the problem of tractogram filtering interpreted as a binary classification problem and lists the most relevant methods in this realm together with the main criteria used to group the streamlines.

Fig. 1
figure 1

Tractogram filtering can be seen as classifying each streamline in a tractogram as either positive (P) or negative (N), targeting true positives (TPs) and false positives (FPs), respectively. Relevant methods use one or more of the depicted criteria for performing this classification. Some representative methods are also depicted in the figure. For a detailed list refer to Table 1

Table 1 lists the most relevant tractogram filtering methods that are reviewed in the next subsections. The columns of this table describe the most relevant characteristics of these methods, namely: criteria, use of dMRI data, required context, main target and whether or not the method is data driven. First, criteria refers to the strategies from Fig. 1 that are followed when employing the method for streamline classification. Column dMRI in the table indicates if the method used the dMRI data or not. As shown, only a few filtering methods make use of the acquired dMRI for performing the classification. The listed methods also differ in the required streamline context. While some are able to perform the classification individually per streamline, others also require the streamlines in the bundles of interest or the complete tractogram as extra inputs (column context). Furthermore, by design, the target of some methods is to define either the positive (i.e., P) or negative (i.e., N) label but not both. In other words, some methods aim at being more specific detecting TPs than FPs, and the other way round for others. As an example, a streamline classified as negative by Recobundles can still be a TP, since it can belong to a bundle not present in the atlas. The target of the method is important to be considered, since preferring higher specificity of P or N is application dependent. This is shown in column target in the table. Finally, the methods can also be grouped into rule-based and data-driven ones. While the former makes use of classical approaches, the latter use machine learning techniques for performing the classification.

In the following sections, we describe the aforementioned criteria in detail and list the most representative methods that use those criteria to perform the classification of streamlines.

2.1 Explainability of the Diffusion Signal

The idea behind this approach is that high-quality tractograms can be used to explain the acquired dMRI data. In other words, synthetic dMRI generated from tractograms should be very similar to the acquired dMRI data. Thus, these methods focus on finding a subset of streamlines which can generate data that approximates the measured signal as closely as possible. Streamlines not belonging to such a subset are likely implausible (or duplicates of other streamlines already contributing to the signal) and might be removed. This approach is shown in Fig. 2.

Table 1 Representative methods for tractogram filtering. Criteria shows the criteria used for streamline classification (cf. Fig. 1). Column dMRI indicates if the method uses diffusion data. Context describes the contextual information required for streamline classification. Target shows the target labels where the method is more specific. Data driven specifies if the method makes use of machine learning or not
Fig. 2
figure 2

dMRI signal fitting approach. Synthetic dMRI is computed based on a microstructure model and compared to the acquired dMRI. The streamlines that are not relevant to minimizing the residual might be filtered out

As shown in Table 1, these methods require the whole tractogram as an input. In the following subsections, we will describe the most commonly used methods for dMRI signal fitting and discuss their issues.

2.1.1 LiFE and COMMIT

Linear Fascicle  Evaluation (LiFE) [7, 42] and Convex Optimization Modeling for Microstructure Informed Tractography (COMMIT) [12] state the problem as follows: let \(\mathbf {y}\) be a vector with the acquired diffusion signals, \(\mathbf {A(T)}\) be a forward operator for synthesizing diffusion data from the streamlines of the tractogram \(\mathbf {T}\), \(\mathbf {x}\) be a vector with weights for the contribution of every streamline to the acquired data, and \(\eta \) be the acquisition noise. Then \(\mathbf {y}\) can be written as:

$$\begin{aligned} \mathbf {y} =\mathbf {A(T)}\, \mathbf {x}+\eta . \end{aligned}$$
(1)

Since the weights \(\mathbf {x}\) cannot be negative, it is possible to solve (1) through non-negative least squares:

$$\begin{aligned} \displaystyle \underset{\mathbf {x}\ge 0}{arg\, min} ||\mathbf {A(T)\,x-y}||_2^2 \end{aligned}$$
(2)

Filtering is performed by discarding streamlines with low weights. This formulation allows for the use of different models to couple the information from streamlines to the measured signal or derivatives thereof. First, \(\mathbf {A}\) can be chosen from a large variety of forward operators proposed in the literature [39]. As an example, in the original papers, COMMIT was based on a multi-compartment forward model while in LiFE the stick and ball model was used. Notwithstanding, both can be adapted to use any other model. Second, different solvers can be applied for solving the non-negative least-squares problem of (2). Both COMMIT and LiFE use the subspace Barzilei-Borwein (SSB) solver proposed in [28]. Due to the nature of the problem, sparsity on weights is desirable. For this purpose, COMMIT proposes a basis pursuit de-noise (BPDN) formulation of (2) that actively considers sparsity by minimizing the \(\ell ^1\)-norm of \(\mathbf {x}\). Such a formulation can be written as:

$$\begin{aligned} \displaystyle {\mathop {\mathrm {arg min}}\limits _{\mathbf {x} \ge 0}} ||\mathbf {x}||_1, \text { subject to } ||\mathbf {A(T)\,x-y}||_2^2 \le \epsilon , \end{aligned}$$
(3)

where \(\epsilon \) is a parameter.

In order to reduce the inherent computational burden of these strategies, \(\mathbf {A}\) is implemented in LiFE and COMMIT through a lookup table on a dictionary of precomputed estimations. Moreover, a GPU-based optimized version has recently been proposed for LiFE [29].

2.1.2 SIFT and SIFT2

Very similar  to the techniques from the previous subsection is Spherical-deconvolution Informed Filtering of Tractograms (SIFT) [55]. Instead of targeting the raw dMRI data, it aims at reconstructing the fiber orientation distribution function (fODF) in each voxel. First, the fODF is obtained with constrained spherical deconvolution (CSD) [63]. Second, the contribution of every streamline to the fODF is assessed. These contributions are used to determine whether a streamline is deemed redundant/noisy or not. Third, these contributions are sorted in order to remove the least relevant streamlines. Finally, the aforementioned two steps are iterated until either a target number of streamlines or a certain residual level is reached. Unlike LiFE and COMMIT, SIFT does not generate weights per streamline. Thus, SIFT2 was proposed as a slight modification of SIFT in which an additional regularization term is added and a weight per streamline is computed [56].

In order to compare the agreement between SIFT and SIFT2, we run them with their standard parameters on a whole-brain tractogram computed with anatomically-constrained tractography (ACT) [54] from MRrtix3Footnote 1 with one million streamlines obtained from one HCP subject. In this experiment, SIFT selected 34.6% of the streamlines in the original tractogram. Figure 3 shows the histograms of weights computed with SIFT2 where the individual histograms are obtained by separating the streamlines based on whether they were accepted or discarded by SIFT. It can be seen that the two histograms have a big overlapping region. A repetition of the experiment with 500k streamlines showed similar results. This means that it is difficult to reproduce the results from SIFT with the weights computed with SIFT2. Thus, while SIFT2 can be useful for describing the contributions of streamlines to the acquired data, unlike SIFT, its direct use for tractogram filtering is not straightforward.

Fig. 3
figure 3

Histograms showing the frequency of different SIFT2 weights computed for streamlines that are filtered out (in blue) or not (in orange) with SIFT. Both methods were computed on an HCP subject with 1 million streamlines computed with ACT in MRtrix3

2.1.3 Issues of dMRI Signal Fitting

While dMRI signal fitting is appealing, it has not been able to significantly reduce the false positive rate [50, 52]. This fact can be attributed to various reasons. Daducci et al. [11] discuss a number of issues of dMRI signal fitting. First, most of the current dMRI signal fitting methods require a whole brain tractogram, even if a single white matter bundle is of interest. This results both in an unnecessary computational burden and a higher risk for false positives when targeting specific fiber bundles. For example, if a fiber bundle is not appropriately represented in the full tractogram, which is not uncommon, there is a higher risk for implausible streamlines from other bundles to take their place in the reconstruction. Second, the computed weights of streamlines tend to be inversely proportional to the number of similar streamlines in the tractogram. This effect is shown in Fig. 4 for the case of SIFT2 (a similar behavior is expected from LiFE and COMMIT). As shown, the SIFT2 weights are lower in the centerlines of the bundles, where tractography tends to yield more streamlines. Thus, thresholding of SIFT2 weights cannot be used for progressive filtering, since that might result in discarding the most important tracts very early. Moreover, some noisy streamlines might be classified as acceptable just because of the reward they get for reaching distant regions. This issue comes from the fact that weights of SIFT2 are designed for fitting the acquired data, but not for filtering. This could potentially be solved through an extra step of weight normalization that, to our knowledge, has not been proposed so far. Third, as discussed in [11], minimizing the residual from Fig. 2 does not guarantee that the solution is plausible as the current methods are very prone to overfit due to the large amount of unknowns that must be estimated. Finally, working with incomplete streamlines can lead to biased results in the uncovered regions, which might be especially problematic for structural connectivity.

Fig. 4
figure 4

Average weights computed through SIFT2 per voxel. Darker and brighter voxels correspond to lower and higher SIFT2 weights, respectively. The values are extracted from the same dataset of Fig. 3

In addition to these issues, most dMRI signal fitting methods compute a single weight per streamline, which can lead to acceptance of implausible streamlines that are erroneous in a small region (cf. Fig. 5). As suggested in [42], this issue can be handled by having variable weights along the streamlines. However, this solution might come at a cost of numerical instability. Finally, an important aspect to consider is the applicability of these methods to diseased brains. For example, it was reported in [72] that using SIFT in certain types of illnesses (e.g. brain tumors) can lead to wrong conclusions of connectivity changes.

Fig. 5
figure 5

The course of two close fiber bundles is depicted in dashed green. A local error (blue box) can lead to a noisy streamline (in red). Since most segments of its path are correct, it still might be classified as acceptable by most dMRI signal fitting methods

In [69], we tested the performance of SIFT2 in short synthetic streamlines. First, we generated dMRI data with Fiberfox [35] with and without noise from a set of straight and bent short streamlines. Second, we added a set of streamlines at a different angle. These added streamlines should ideally be classified as noisy as they do not comply with the generated dMRI data. Finally, SIFT2 was run in order to assess its ability to separate the ‘signal-generating’ streamlines from the added ones. The main result of these experiments is that SIFT2 is able to filter the implausible streamlines in this simplified problem only for straight streamlines with b-values below 3000 s mm\(^{-2}\) and low levels of noise. Our hypothesis for this is that dMRI signal fitting methods might benefit from longer streamlines for increasing their stability. However, as already discussed, stability cannot guarantee a good solution as shown in Fig. 5.

It is worth to note that the filtered tractograms have been reported significantly different from the non-filtered ones [45]. However, it is difficult to assess whether or not the differences are due to the reduction of noise or to the aforementioned inherent problems of this methodology.

Finally, it is important to remark that global tractography [33, 46] is closely related to dMRI signal fitting. Both use the acquired diffusion data to assess the quality of the tractograms, one for generating streamlines, and the other to assess the plausibility of them. Thus, global tractography shares the same issues of dMRI signal fitting. On the other hand, the current dMRI signal fitting methods are more efficient than global tractography.

2.2 Inclusion and Exclusion ROIs

A different approach  is to implement anatomical constraints in terms of using segmentation masks for filtering out implausible streamlines.

An intuitive way to employ such masks is atlas-based tractogram filtering. This strategy comprises two steps: first, the atlas of streamline bundles is registered to the subject data, and second, streamlines that are not fully contained within a single bundle mask after registration are filtered out. Despite its simplicity, this method has various shortcomings. First, compared to gray matter (GM), registration of WM in raw data is more challenging due to the relatively low contrast and the less convoluted structure [58]. This makes registration more prone to errors [22]. Thus, different methods have been proposed for the specific purpose of WM registration on both raw and derived features such as fractional anisotropy (FA) maps (cf. [38] for a review of methods). Second, atlas-based approaches are known for not being able to model the anatomical variability among subjects very well. Finally, the anatomical shape of fiber bundles can change due to illnesses (e.g. brain tumors), making the use of atlases difficult in such applications.

An interesting alternative to atlas-based filtering is to estimate segmentation masks of fiber bundles directly from dMRI data. By that, the problematic registration step is avoided. TractSeg [67, 68] follows this idea by training a neural network for segmenting 72 different fiber bundles. In this approach, fODF peaks are taken as input to a 2.5D U-Net architecture for segmentation [48]. Unlike atlas-based methods, TractSeg does not require registration and is subject-specific since it only uses the acquired data.

Alternatively, Wassermann et al. [66] proposed the white matter query language (WMQL) and its implementation TractQuerier, which can be used to define fiber bundles of interest using high-level relationships between GM and WM structures. The method uses any parcellation for locating regions of GM that can be used in the query. WMQL can be used for filtering out streamlines not compliant with any of the anatomical rules defined for extracting all bundles of interest. Notice that although neither TractSeg nor TractQuerier were proposed for tractogram filtering, they can be easily adapted for this purpose. Moreover, it is important to remark that TractQuerier itself is only a tool to define the mentioned rules. Thus, it can potentially be used for tractogram filtering either targeting TPs or FPs.

It is important to notice that methods using segmentation masks as inclusion or exclusion ROIs could be unable to detect certain implausible streamlines that are not rejected. That is, defining anatomy solely based on inclusion/exclusion ROIs is often only a necessary but not a sufficient condition for determining the plausibility of streamlines. Moreover, while some tractography methods do include anatomical priors in the formulation (e.g. [26, 54]), in practice, the resulting tractograms usually contain false positives.

2.3 Streamline Geometry or Shape

In this approach, the shape of streamlines is used for deciding whether or not they are plausible. Most tractography methods already include restrictions on the shape of streamlines, such as maximum local curvature or minimum and maximum length. Moreover, some tractography methods include geometry priors at a higher level [4, 15]. Since these rules are usually included in state-of-the-art tractography approaches, there has been less need for filtering methods implementing this idea.

A recurrent issue in probabilistic tractography is the existence of streamlines with unrealistic loops. In [3], changes in track density are used for detecting and removing such loops. Using a more general formulation, it is assumed in [65] that the relationship between a streamline and its neighbours should be similar along the path. Such relationships are modeled through graphs and are analyzed using spectral graph theory. This method is able to remove unrealistic loops while keeping the ones that comply with anatomy.

Recently, the usefulness of streamline geometry for filtering has been shown. For example, in ExTractor simple geometry priors are employed to assess streamline plausibility [43]. Surprisingly, the authors found that around half of the streamlines generated by many state-of-the-art methods do not comply with such priors. A graph-based neural network was trained in [1] to reduce the computational cost of ExTractor.

Additional geometry priors can potentially be useful for tractogram filtering. Geometrical features extracted from a population could be used to assess the plausibility of streamlines at a local level. For example, an atlas of local sheet probability index [61], statistical shape models of fiber bundles [15] or an atlas of local orientation and curvature [6] could be beneficial for removing false positives.

Also deep learning-based methods have been proposed for tractogram filtering, most of them relying mainly on streamline geometry as criteria. For instance, in FiberNet a convolutional neural network is used on the coordinates of the streamlines after a spatial normalization for bundle classification [24]. In [73], the authors extract a set of features from the shape of streamlines (named FiberMap) to train their bundle classifier. Streamlines not belonging to a known bundle are assigned to an additional class. Furthermore, TRAFIC trains a network using distances to five landmarks, curvature and torsion per tract as features for filtering [37]. Moreover, DeepBundle [30] used a graph convolutional neural network for extracting geometric features from the streamlines. Such learned features are then used to assign them to their more likely fiber bundle in an end-to-end fashion. The loss function can be designed to target false positives. Since these deep learning-based methods use streamlines of specific fiber bundles for training, their main target are the TPs in those bundles.

Similarly to the inclusion and exclusion of ROIs, geometry constraints of the streamlines are not necessarily sufficient criteria for deciding the validity of streamlines and must be combined with other priors. That does, however, not apply to the mentioned DL-approaches which—depending on the training labels—are able to learn a model of plausible streamline geometry.

2.4 Streamline Similarity and Clustering

With this strategy, streamlines are clustered in bundles before further analysis. Such clusters can be used as surrogates of the underlying structure of WM. Fibers belonging to small clusters or that do not share similar properties of bundles of interest can be removed.

The only requirement for using standard clustering algorithms for streamline clustering is to define a distance metric between streamlines. While proposing distance metrics is straightforward, it is more difficult to find the most appropriate one for streamlines. Depending on the application, a tractogram can consist of millions of streamlines. Thus, it is critical to use efficient implementations. For example, Quickbundles [18] was proposed as a tool for performing clustering very efficiently.

Once clusters of streamlines are extracted, there are different alternatives for performing filtering. For example, Recobundles uses an atlas-based strategy in which the clusters are first registered to an atlas of streamline bundles, followed by a pruning procedure of streamlines lying far away from the registered centroids [19]. Following another strategy, in [74], 800 clusters of streamlines are computed for a number of subjects that, after manual curation performed by an expert, are used for creating an atlas of streamline clusters. This curated white matter atlas (WMA) is used for filtering out streamlines far away from any cluster in the atlas. Following a different idea, BundleMAP [27] uses support vector machines on the mean and covariance of the coordinates of the streamlines in a bundle to detect FPs.

Clustering methods based on deep learning have the potential to be computationally more efficient than classical approaches. In [40], the authors proposed FS2NET, a Siamese deep neural network that uses bi-directional long short term memory (LSTM) layers for learning a distance measure between streamlines. With this distance, the method can be used to assess if two streamlines should be clustered together or not.

Implausible streamlines tend to follow more erratic paths compared to plausible ones. Thus, using clustering is appealing for filtering, since it introduces strong requirements on smoothness of streamlines. Moreover, combined with atlas-based approaches, they are able to filter both whole brain or partial tractograms. Unlike the atlas-based approaches described in Sect. 2.2, the registration is performed on tractograms, which tends to be more accurate (cf. [38] for a review of methods). Still, the inherent issues of atlas-based approaches might have an impact on the accuracy of such methods.

Since the methods of this subsection are based on bundle similarity, they target only certain bundles and, by that, only TPs.

2.5 Multiapproaches

From the previous discussion, it is natural to devise methods taking advantage of different priors for increasing accuracy. Due to the fact that the research field is relatively new, only a few multiapproach methods have been proposed. In this line, COMMIT2 [50] adds anatomy priors to the original formulation of COMMIT in order to target the issues of dMRI signal fitting methods. Another example is anchor-constrained plausibility [36], which combines streamline clustering and dMRI signal fitting for performing filtering. In [23], FiberNet2.0 has been proposed as an extension of FiberNet in which inclusion/exclusion of ROIs are added to the processing pipeline.

We have recently used deep learning for combining two methods: RecoBundles and ExTractor [25] using the dMRI signal as the only input of the neural network. From our preliminary results, it is not obvious which method should be used as gold standard, as the choice of accuracy measurement depends very much on the application. Thus, while a perfect combination of priors is not straightforward, from our experience in [25], we expect new methods that combine two or more priors to perform better on average.

3 Challenges and Perspective

Machine learning and in particular deep learning has been very successful in many medical image analysis applications in the last few years. Preliminary efforts show the potential of this approach also for the specific problem of tractogram filtering [1, 25]. However, important challenges for methods following this approach remain open.

Fig. 6
figure 6

Percentage of accepted streamlines obtained with Tractquerier (TQ), Recobundles (RB), TractSeg (TS) and COMMIT (CM) on a 10M tractogram of one HCP subject. Left: Thresholding the number of positive labels from the four methods per streamline (majority voting). Right: Individual methods

As mentioned in [44], there are important general challenges for tractography, which also apply for tractogram filtering. Specifically, machine learning and deep learning require large amounts of training data of good quality that are difficult to obtain for tractogram filtering. Moreover, the available training data is relatively scarce and difficult to combine. Also, inter observer variability of manual annotations is particularly severe in tractography [47, 52]. Furthermore, it is in general questionable if manual annotation of whole brain tractograms would ever become a feasible goal. For this reason, the definition of adequate training labels in absence of a ground truth or a strong gold standard can be expected to remain the main challenge for machine learning-based methods in this field in the future.

One way of  addressing this issue is to combine different methods as automatic annotation tools in order to define a gold standard. Following this idea, in [25], we proposed a method that builds on top of two methods, namely Recobundles and ExTractor, for defining labels. However, this combination is not straightforward. As pointed out in Table 1 and in the previous sections, it must be considered that different methods assess different characteristics of the tractogram. For example, the rejection of a streamline based on geometry priors could have a higher confidence than basing that decision on a clustering argument. The reverse is also true: a streamline close to an anatomically plausible cluster might be accepted but a streamline compliant with a finite number of geometry-based rules could still violate other unchecked rules and therefore be implausible. Filtering also depends on the application. If the goal is to obtain segmentation masks, geometrical constraints could have a lower value.

In order to  investigate the process of finding a good balance in the combination of different automatic annotation tools, we run Tractquerier (TQ), which is an implementation of WMQL, Recobundles (RB), TractSeg (TS) and COMMIT (CM) in a tractogram of 10 million streamlines computed with ACT [54] for one HCP subject. A naïve approach to combine the methods would be majority voting. Figure 6 (on the left) shows the acceptance rates of streamlines for the testing dataset after performing a majority voting with different thresholds. As shown, requiring at least three methods to accept a streamline would result in a massive filtering of 95.1% of the dataset. Even with a milder threshold of at least two methods, 79.1% of the dataset would be filtered out. While the amount of around 20% of accepted streamlines (i.e., two million streamlines in this example) could be enough to fill up the space of the WM with an improved TP-to-FP ratio, potential biases of such an ad hoc approach would need to be investigated. Also the low agreement for only 50.6% of the streamlines (0.2% for four positive votes and 50.4% for four negative votes) indicates that a more sophisticated strategy for the combination of different tools should be considered. This could maintain a higher acceptance rate of TP streamlines and by that potentially reduce the required number of streamlines in the tractogram.

Figures 6 (on the right) and 7 show the percentage of streamlines accepted by each method as well as their agreement, respectively. As shown, the most restrictive method is CM and the most relaxed one is TQ. While the other two methods are in the middle, they are also rather restrictive. Moreover, any pair of methods agrees in 60–80% of the streamlines. From these initial analyses, it is clear that more research is needed in order to find better ways of synthesising information from different methods for the purpose of tractogram filtering than just simple majority voting. Again, the final application must also be considered for assessing the ideal approach.

Fig. 7
figure 7

Agreement between Tractquerier (TQ), Recobundles (RB), TractSeg (TS) and COMMIT (CM) on a 10M tractogram of one HCP subject. Every entry shows the percentage of streamlines with the same classification label obtained by the corresponding pair of methods

In addition to combining different methods, using other prior information can be potentially useful for tractogram filtering. For example, including specific microstructure information has been useful for tractography [20, 51], which can also be expected from tractogram filtering methods. Combining dMRI and functional information is promising to understand the mechanisms for brain connectivity [14]. Thus, it would be interesting to explore in the future the use of functional data for tractogram filtering.

4 Conclusion

Tractogram filtering is a relatively young but very active research area with a high potential of development. While the reviewed methods in this chapter show that there are a multitude of ways to obtain information related to the plausibility of streamlines, there is yet no holistic approach to separate the TP and FP streamlines in a tractogram in a fully satisfying way. In our opinion, machine learning-based methods have the potential to contribute substantially to tractogram filtering. However, at this moment the applicability of supervised approaches is tightly coupled to the proper definition of training labels, which is difficult to obtain in the absence of a ground truth. We see the combination of different automatic annotation tools, potentially complemented with manual annotations from neuroanatomists, as a promising avenue to address this problem, while developments are still needed in that line of research in the future.