Fast Streamline Search: An Exact Technique for Diffusion MRI Tractography

In this work, a hierarchical search algorithm is proposed to efficiently compute the distance between similar tractography streamlines. This hierarchical framework offers an upper bound and a lower bound for the point-wise distance between two streamlines, which guarantees the validity of a proximity search. The proposed streamline representation enables the use of space-partitioning search trees to increase the tractography clustering speed without reducing its accuracy. The resulting approach enables a fast reconstruction a sparse distance matrix between two sets of streamlines, for all similar streamlines within a given radius. Alongside a white matter atlas, this fast streamline search can be used for accurate and reproducible tractogram clustering.


Introduction
In classical anatomy, the study of white matter fascicles and bundles connecting different brain regions required dissection. The non-invasive analysis of these connections has been greatly facilitated by the use of diffusion weighted MRI (Catani et al., 2002;Jones, 2008). From diffusion weighted MRI, tractography algorithms can be employed to investigate the white matter structure and connectivity (Wakana et al., 2007;Descoteaux, 2015;Jbabdi & Johansen-Berg, 2011). Representing white matter pathways, these tractography streamlines are often grouped in bundles for further analysis, such as tractometry (Bells et al., 2011;Chamberland et al., 2019a;Chamberland et al., 2019b).
Streamlines reconstructed from a tractography algorithm are composed of an ordered list of points, depicting local white matter position and trajectory. Each streamline is a polygonal chain, a set of connected line segments (also named polyline in computer graphics), with some specific characteristics that depend on the tractography algorithm. For example, most tractography algorithms reconstruct streamlines with a fixed step size (segment length) and a maximum turning angle (Tournier et al., 2012;Côté et al., 2013;Behrens et al., 2014).
Multiple tractography applications require a grouping of similar streamlines for analysis. These streamlines can be clustered based on shape similarity and proximity. Numerous algorithms and definitions of distance have been studied to improve the accuracy and efficiency of streamlines clustering (Guevara et al., 2011;Siless et al., 2013;Garyfallidis et al., 2012;Garyfallidis et al., 2016;Olivetti et al., 2017;Vázquez et al., 2020). Searching for the nearest streamline in a pre-segmented set of streamlines (called a bundle atlas) can be used to automatically dissect a tractogram into different bundles and white matter pathways (O'Donnell & Westin, 2007;Garyfallidis et al., 2018;Wang & Shi, 2019;Bertò et al., 2021). However, current approaches rely on space embedding techniques or subsampling without any distance-preserving guaranties, resulting in approximate distance.
In parallel, similar proximity search algorithms have been proposed in the data-mining field to analyze and compare time series data (Liao, 2005;Fu, 2011;Wang et al., 2013;Kotsakos et al., 2013). A multivariate time series could also be represented as a polygonal chain. Nonetheless, current distance measures for tractography streamlines do not directly fit in this framework. Interestingly, some bounded dimensionality reduction techniques employed for time series can be adapted to an existing streamline distance measure (Yi & Faloutsos, 2000;Keogh et al., 2001;Chan et al., 2003;Wang et al., 2013). One of these techniques, the piecewise aggregate approximation (Keogh et al., 2001), can be adapted to estimate Euclidean-based streamline distance, offering a lower bound which guarantees no false dismissal. In other words, this approximation never overestimates the given distance; thus, it never wrongly rejects streamlines in a radius search.
In this work, we focus on a streamline representation/ simplification that conserves important distance properties. The resulting hierarchical representation enables the use of standard binary search trees to increase the clustering speed. In addition, the theoretical upper and lower bounds are used to ensure the accuracy of the proximity search. The resulting formulation can be applied to efficiently compute an exact nearest neighbor (NN) or k-nearest neighbors (KNN) search within a maximum distance.

Methods
This section describes the proposed hierarchical approach to efficiently evaluate streamlines distance. For this, we first detail the mathematical framework to compute the distance between two streamlines. Followed by interesting mathematical properties used to optimize streamline representation and construct the proposed hierarchical search. Afterward, this approach is employed to search for the nearest streamline in a pre-segmented white matter bundle atlas, described in the "Experiments" section.

Distance Between Two Streamlines
A tractography streamline S = [ 1 , ..., m ] is defined as an ordered series of m points, where each of those points lives in a three-dimensional space i ∈ ℝ 3 , i ∈ {1, ..., m} . The distance between two points in n-dimensions ( , ∈ ℝ n ) is generally defined by the Minkowski distance ( L p -norm).
This distance is a generalization of both the Manhattan ( L 1 ) and the Euclidean ( L 2 ) distance. It satisfies the triangle inequality for any p ≥ 1 , resulting in a valid metric. This can be extended to define the maximum norm ( L ∞ ) as p → ∞ , dual to the L 1 norm in finite-dimensional spaces. This research focuses on the L 1 and the L 2 norms. Nonetheless the L ∞ provides some bounding capacity and is sometimes used in binary search trees.
The proposed method utilizes the sum (or average) of L p -norm to compute the distance between two streamlines The "mdist L p (⋅, ⋅) " is employed to compute the average point-wise distance. This is done to normalize the distance by the number of points. When computed for both ascending ( W = [ 1 , ..., m ] ) and descending ordered points ( W � = [ m , ..., 1 ] ), the average L 2 distance is equivalent to the minimum-average direct flip (MDF) proposed by Garyfallidis et al. (2012). The minimum-average direct flip is often used for streamlines clustering, similarity search and registration (Olivetti et al., 2017;Garyfallidis et al., 2018). For tractography, each point is in a tridimensional space, but this measure could be used in higher dimensions. The "dist L p (⋅, ⋅) " between two streamlines is depicted in Fig. 1, equivalent to the sum of the norm of directed vectors ( i = i − i ). This sum of the norm is also known as the L p,1 entry-wise matrix norm.

Sum of Norm Properties
In this subsection, a few interesting properties of the sum of L 1 and L 2 are described. These characteristics are used to modify the streamline representation while keeping important distance properties. These mathematical remarks are further detailed in Appendix A. Fig. 1 Pair-wise distance between two tractography streamlines (U, W) from an ordered list of points ( m = 4 ), where Remark 1 When using Manhattan ( L 1 ) distance, comparing two lists composed of m n-dimensional points ( i , i ∈ ℝ n , i ∈ {1, ..., m} ) is equivalent to computing the L 1 distance between two m × n dimensional points ( , ∈ ℝ m×n ).

Remark 2
The L 1 distance in n-dimensions can be used as an upper and a lower bound obtained from Hölder's inequality.
From this general equation, the Euclidean ( L 2 ) distance can be bounded with L 1 with the previous inequality ( q = 1, p = 2 ). Figure 2 illustrates this inequality with bounded volume ( ‖ ‖ p ≤ r).
The sum of distances (p-norm) follow the same rules, since all summed values are positive.
i is the mean position of a streamline, then the distance between the mean position of two streamlines is always smaller or equal to the average pointwise distance.
Thus, averaging points together can be used to reduce the number of points to compare, without increasing the distance. This type of aggregation of points is used extensively for time series analysis, such as the piecewise aggregate approximation (Keogh et al., 2001), or Haar wavelet transform (Chan et al., 2003).

Streamlines Representation & Simplification
Resampling Some form of resampling is required when comparing streamlines with different numbers of points with a point-wise distance. Tractography generates streamlines with a fixed step size, resulting in individual segments of equal length. Thus, to compare streamlines from start to end, each segment can be subdivided according to the least common multiple of the number of segments. This subdivision ensure an uniform distribution of points along each streamline for the point-wise distance without changing its geometry (see Fig. 3

-a).
Downsampling Subsampling is often used to reduce streamline complexity, but in the general case, it does not offer any bounding property. Therefore, removing points before the comparison of streamlines can reduce or increase the average distance between them. This is illustrated in Fig. 3 where keeping the filled-in points will increase the distance, and keeping the hollow points will decrease it. Thus, some neighbors could be missed when doing a proximity search using subsampled streamlines, resulting in an approximate search.
Illustration of norm inequality from remark 2 (Eq. 8). In 2D, the Euclidean distance ( L 2 ), displayed in red, is bounded between 3 a Uniformly resampling streamlines can be used to make two streamlines, with unequal number of points, directly comparable using a point-wise distance; for equal length segments, this can be computed with the least common multiple of the number of segments. b Subsampling points can reduce (gray dashed lines) or increase (red dotted lines) the average point-wise distance between two streamlines. c Averaging points (green dashed lines) never increases this distance -it can only reduce it Averaging As demonstrated earlier in remark 3, averaging points together never increases the average distance ("mdist L p (⋅, ⋅)"). Therefore, when searching for all similar streamlines in a given radius (r) using averaged points, it is guaranteed that the distance remains inside that radius. Consequently, computing the barycenter, or multiple mean points ("sub-barycenters"), is an effective way to reduce the number of comparisons (i.e. to reduce dimensionality) for tractography streamline proximity search. This concept of aggregating points will be used to a generate a hierarchical comparison method, similar to a multiscale approach. Fig. 3-c illustrates this simplification by averaging together 3 points along each streamlines, resulting in 2 mean points to compare.

Proposed Hierarchical Streamline Representation
Based on previous remarks, we propose a new hierarchical approach for tractography streamlines proximity search (exact NN or KNN) within a maximum distance. In this subsection, the framework is detailed in three procedures: barycenter binning, simplification by averaging, and distance refinement. When this approach is employed to search for the nearest streamline in a template (presegmented white matter bundle atlas), these procedures are used for both: the template construction (Fig. 4), and the resulting hierarchical streamlines search (Fig. 5).
Barycenter Binning First, the barycenter of a streamline can be used as an initial proximity search. Because the distance between two barycenters is never greater than the "mdist L p (⋅, ⋅) " (see remark 3), it can be used to limit the proximity search (Fig. 5-c). Coordinates for all barycenters can be grouped together on a regular grid. When searching for all similar streamlines within a specified range (r), only the current grid and its neighbors (within distance r) need to be examined. The binning size can be optimized based on the amount of streamlines and the radius (r) of the proximity search; smaller bins will increase the preprocessing and construction time, but reduce the subsequent search time. Moreover, this barycenter binning provides independent bins, enabling efficient multithreading and the reduction of memory usage. Consequently, each bin can be separated and processed individually, only requiring the current and neighboring bins. When searching in a template, bins can be generated with an overlap to avoid to look at neighboring bins (see Fig. 4-D).
Simplification by Averaging Second, streamline points can be aggregated to create a simplified version with mean points (Fig. 5-b). This is done to reduce the number of dimensions when using a binary search tree, thereby "avoiding" the curse of dimensionality (Marimont and Shapiro, 1979;Verleysen and François, 2005;Pestov, 2013). When the number of mean points ( ) is a divisor of the initial number of ing streamlines using barycenter bins with an overlap greater or equal to the search radius, F space-partitioning tree structure for each bin using mean points points (m) the remark 3 remains true; otherwise a continuous averaging needs to be done. Afterwards, each streamline's mean points are vectorized in a × n vector, to employ both remarks 1-2. This vectorization enables the use of standard space-partitioning tree structure, which greatly increase the search speed (see Fig. 5-d). For the distance between streamlines "mdist L p (⋅, ⋅) ", with the L 2 norm per points, the searching range need to be increased by the square root of the spatial dimensionality ( √ n ). Since the distance between simplified streamlines using mean points is always smaller than or equal to the tractography streamline point-wise distance (remark 3), this aggregation enables the search for all similar streamlines within a certain radius, without missing any streamlines (Fig. 5-e). This results in a radius search with no false dismissals (no false negatives), where remaining false positives can be rectified with a refinement step.
Distance Refinement Finally, the resulting similar streamlines with their respective distances, obtained from the proximity search using simplified streamlines, can be refined by computing the complete distance (without simplification). When searching for all similar streamlines within a radius (i.e. proximity search), streamlines with a complete distance larger than the desired radius need to be filtered out (Fig. 5f). KNN search, within a maximum radius, can be done by computing refined distances and extracting the first K streamlines.

Dataset
Tractography For the evaluation, we utilized 44 subjects from the Human Connectome Project (HCP) dataset (Van Essen et al., 2013). Tractography streamlines were reconstructed using probabilistic particle filtering tractography (Girard et al., 2014) implemented in Dipy (Garyfallidis et al., 2014). Resulting streamlines were a) Streamlines b) Simplification (2 mpts) c) Barycenters binning d) Template tree search e) Mean-points distance f) Refined distance Fig. 5 Streamlines nearest neighbor search to a previously generated template (see Fig. 4): a given streamlines to cluster, b simplification by averaging resulting in mean points (mpts) per streamline, c barycenter (1 mean point) per streamline, binned using template bins (without overlap), d-e for each given streamline compute the distance to template streamlines in the same bin within the search radius using mean points employing the space partitioning tree, f recompute the complete point-wise distance for all neighbors pair, given by the previous step, and return the nearest aligned to the MNI space (ICBM 2009a, Fonov et al., 2011 using ANTs affine registration (Avants et al., 2008). This registration was computed from the T1-weighted image of each subject (already aligned with the distortion corrected diffusion space) to the MNI template.

Bundle Atlas
The bundle atlas employed for the experiment is detailed in Garyfallidis et al. (2018); Yeh et al. (2018). Streamlines from this atlas were already aligned to the MNI space (ICBM 2009a, Fonov et al., 2011, which is composed of 33 bundles (9 inter-hemispheric, 12 intra-hemispheric) with a total of 210K streamlines.
Streamlines All streamlines were defined with 32 points ( m = 32 ), to limit the variability in our testing, and make the proposed proximity search comparable to RecoBundles (Garyfallidis et al., 2018), because RecoBundles downsamples all streamlines to a fixed number of points (12). For the proposed approach, all streamlines from the atlas were compared with both ascending and descending (flip) order, resulting in 420K streamlines. This makes the employed distance "mdist L 2 (⋅, ⋅) " equivalent to RecoBundles' minimumaverage direct flip distance.

Evaluation
The proposed hierarchical streamline search was quantitatively evaluated by measuring the computation time. Each streamline search method was computed twice per subject to avoid aberrant run time, keeping the smallest time for each subject (except for longer run without binning or without mean points). This computation time is afterward averaged over all 44 subjects to compare the efficiency of the proposed method with various parameters. This computation time did not include any file loading or saving time. The proposed approach was evaluated with and without the barycenter binning at various bin_size (4mm, 8mm, 12mm). The simplification by averaging was compared at different numbers of mean points (2,4,8). Without barycenter binning and simplification, this is equivalent to a brute force search with quadratic time.
For each subject, multiple sets of streamlines were used to vary the total amount of streamlines (500K, 1M, 2M, 4M). The proximity search radius was evaluated from 2mm to 12mm, in 2mm steps. The proximity search was applied to 44 subjects using all 33 bundles from the atlas.
The proposed algorithm was also compared to RecoBundles (Garyfallidis et al., 2018) with various number of streamlines. RecoBundles was run with its default parameters from Scilpy(v1.1.0): subsampling streamlines to 12 points, a pruning distance of 8mm, and a clustering threshold of 12mm (Garyfallidis et al., 2014;Garyfallidis et al., 2016;Rheault, 2020). It should be noted that the proposed fast streamline search is not equivalent to RecoBundles; RecoBundles subsamples streamlines and relies on the QuickBundles clustering algorithm, resulting in an approximate search. Moreover, RecoBundles/QuickBundles prune clusters using an adapted clustering threshold for each bundle. The goal of this evaluation is to give an idea of the clustering speed of the proposed streamline search method, compared to a state-of-the-art similarity-based clustering method (RecoBundles). Computation times were measured from a single core on Intel's 2.4GHz Skylake 6148 processor. The proposed method employs Scipy(v1.6.3) cKDTree for space-partitioning (Virtanen et al., 2020).
Streamline simplification errors were evaluated to compare the conventional subsampling and the proposed mean points averaging. In addition, inaccuracy was estimated using false positive and false negative rates compared to an exact brute force search from 500K streamlines. For this accuracy test, RecoBundles results were averaged from 30  Figure 6 details the computation time for streamline proximity search within 8mm (mdist L 2 (⋅, ⋅) ≤ 8mm ) at various numbers of streamlines (500K, 1M, 2M, 4M), searching for similar streamlines in a bundle atlas of 210K streamlines. When using both barycenter binning and simplification, the resulting clustering speed is comparable to RecoBundles. Figure 7 presents the computation time as a function of the search radius (from 2mm to 12mm) for 4 million streamlines. When using barycenter binning, a simplification with 4 mean points (nb_ mpts=4) performs slightly better for any number of streamlines (from 500K to 4M) and search radius (from 2mm to 12mm). Using less mean points (nb_mpts=2) decrease the tree search time but increase even more the distance refinement computation time. This is reversed with more mean points (nb_mpts=8), since it reduces the distance refinement time but further increases the search time. It can be observe from Figs. 6, 7, that the optimal barycenter binning size varies in function of the number streamlines and search radius.

Quantitative Comparison
Distance errors from downsampling streamlines are presented in Table 1. Overall, the proposed mean points results in a smaller mean absolute/squared error. Table 2 depicts the number of false positives and false negatives for each method at various resampling. Since the mean points approach did not increase the distance between two streamlines (from a positive minimum difference in Table 1), it resulted in zero false negatives when using a brute force approach. Without refinement, the fast streamline search is equivalent in accuracy to an exhaustive search with simplified streamlines. It can be noted that the proposed fast streamline search with refinement and mean points simplification results in an exact search, based on the "mdist L 2 (⋅, ⋅) " measure. Thus, only computation time varies when changing the bin size or the number of mean points, resulting distances and clustered streamlines do not change. Figure 8 shows streamlines extracted using both the proposed method (radius of 4mm, 6mm or 8mm) and RecoBundles. Both clusters were obtained from the Corticospinal tract (CST) in the bundle atlas. Results for other bundles are displayed in Appendix B (Figure 9, 10 and 11). Fig. 7 Clustering time comparison using 4 million streamlines for different search radii ( L 2 ≤ r ), for all 33 bundles in the atlas. Computation times were averaged over 44 subjects from the HCP dataset Table 1 Distance errors when using subsampling or mean points at various number of points (4,8,16). Comparing the estimated distance to the exact distance (with 32 points), from 500K streamlines to the left Corticospinal tract (CST) bundle in the atlas. The average error for each approach is presented with both mean absolute error (MAE) and mean squared error (MSE). The minimum and maximum differences are obtained from the exact distance minus the distance with resampled streamlines.

Discussion
Overall, the proposed approach using 4 mean points results in the fastest computation time on average. Streamlines simplification with 4 mean points is a good trade-off between distance refinement computation time and tree search speed. In addition, the optimal barycenter binning size varies in function of the search radius, the number of streamlines and also from one subject to another. Nonetheless, not using this barycenter binning generally results in slower performance (green lines in Figs. 6 and 7) and is highly dependent of the bin size. Directly using all streamlines points (32), without simplification  Fig. 6), results in a poor computation time and heavy memory usage for the binary search tree. Bins of 8mm and 12mm without simplification are not displayed since they were significantly slower than binning at 4mm. Depicted in Tables 1-2, the proposed framework (barycenter binning, simplification by averaging and distance refinement) accurately find all similar streamlines without any false positives or false negatives. Results are visually comparable (Fig. 8) to existing approaches that use an approximate similarity search (RecoBundles), where further comparison are displayed in Appendix B.
Despite this comparison, some part of this fast streamlines search algorithm could be directly integrated inside QuickBundles and RecoBundles to further improve their clustering speed when matching bundle centroids. Additionally, mean points could be employed in other tractography approaches, instead of subsampling, to reduce simplification errors when computing the distance between streamlines.
The proposed lower and upper bound definitions could be further improved using specific properties of tractography streamlines, such as the step size and maximum curving angle. However, these values would change from one tractography algorithm to another. As mentioned previously, tractography streamlines normally have a fixed segment length, however some researchers compress streamlines with the Ramer-Douglas-Peucker algorithm (Hershberger & Snoeyink, 1992) or a similar variant for tractography (Presseau et al., 2015) to save disk space. It should be noted, that compression algorithms modify streamlines thus they will change the original distance.
Other approaches could be used to further reduce the number of points (or dimensions) required when employing a search tree (O'Donnell & Westin, 2007;Olivetti et al., 2012;Wang & Shi, 2019;Legarreta et al., 2021). Nevertheless, those dimensionality reduction techniques on tractography streamlines do not preserve distances nor guarantee any lower/upper limits on distance, resulting in an approximate neighbor search.

d) e) f)
Applications This proposed approach would be useful in a clinical setting when the search accuracy is critical, especially when missing streamlines (from false negatives) could significantly alter the analysis. This type of hierarchical search will become necessary when working with large tractograms, because computing an exhaustive search would be unfeasible. Still, before comparing streamlines, validating the tractography reconstruction and brain registration is crucial for clinical applications.

Limitation & Future Work
This proposed method and streamlines simplification were specifically designed for the sum (or average) of Minkowski p-norm. This hierarchical approach might be adaptable to other streamlines similarity measures described by Olivetti et al. (2017), however it would requires a redefined simplification algorithm along with new upper and lower bounds.

Conclusion
The proposed framework efficiently and accurately search for all similar tractography streamlines inside a given radius. This method can be used to cluster streamlines into bundles, based on an given white matter atlas. The use of simplification by averaging (mean points) combined with a space-partitioning search tree significantly reduces the query time, with no false dismissals. Furthermore, barycenter binning provides independent bins, enabling efficient multithreading and the reduction of memory usage. Finally, this proposed method guarantees accurate results and is comparable in speed to existing approaches using approximate similarity search.

Information Statement Sharing
The datasets employed in this experiment is available at Human Conne ctome. org, from the Human Connectome Project (HCP) (Van Essen et al., 2013). The bundle atlas is available at zenodo. org/ record/ 36136 88 (Garyfallidis et al., 2018;Yeh et al., 2018). Resulting streamlines generated during the current study are available from the corresponding author on reasonable request. Where both the tractography algorithm and RecoBundles segmentation were computed with Dipy at dipy. org (Garyfallidis et al., 2014). An open source implementation of the proposed Fast Streamline Search is available in Dipy.

A. Sum of Norm Properties with Detailed Equations
Proof (Remark 1) The dist L 1 (U, W) is equivalent to computing the L 1 distance between two m × n dimensional points ( , ∈ ℝ m×n ). Fig. 11 Results for the left Uncinate Fasciculus (UF): a the bundle atlas, b RecoBundles result, c RecoBundles result (in green) showing in red a few streamlines missing in RecoBundles (false negatives) but present in the proposed exact search at 4mm radius, and in purple, streamlines missing in RecoBundles but present in the proposed search at 6mm radius. The proposed proximity search, mdist L 2 (⋅, ⋅) ≤ r , using a radius of: d 4mm, e 6mm, and f 8mm a) b) c)

Proof (Remark 2)
The L 1 distance in n-dimensions can be used as an upper and a lower bound the L 2 distance, from Hölder's inequality ( ∈ ℝ n ).

Proof (Remark 3)
The distance between the mean position ( = 1 m ∑ m i=1 i ) of two streamlines is always smaller or equal to the average point-wise distance. This can be obtained from L p -norm properties (1 ≤ p < ∞ , , ∈ ℝ n , ∈ ℝ). (11) permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.