Entrack: Probabilistic Spherical Regression with Entropy Regularization for Fiber Tractography

White matter tractography, based on diffusion-weighted magnetic resonance images, is currently the only available in vivo method to gather information on the structural brain connectivity. The low resolution of diffusion MRI data suggests to employ probabilistic methods for streamline reconstruction, i.e., for fiber crossings. We propose a general probabilistic model for spherical regression based on the Fisher-von-Mises distribution, which efficiently estimates maximum entropy posteriors of local streamline directions with machine learning methods. The optimal precision of posteriors for streamlines is determined by an information-theoretic technique, the expected log-posterior agreement concept. It relies on the requirement that the posterior distributions of streamlines, inferred on retest measurements of the same subject, should yield stable results within the precision determined by the noise level of the data source.


Cerebral White Matter and Diffusion MRI
The structural connectivity between different cortical brain regions is established by white matter, that is composed of myelinated axons to distribute action potentials as messages between communicating neurons. The functional importance of connectivity for cognition has been undisputedly recognized by neuroscience research (Bargmann and Marder 2013;Filley and Fields 2016).
The advent of diffusion-weighted magnetic resonance imaging (DWI) (Chilla et al. 2015;Soares et al. 2013) has empowered neuroscientists and neurologists to monitor changes in the structural connectivity with potential relevance for diagnosis, prognosis and therapy of neurodegenerative diseases (Oishi et al. 2011). DWI is currently the only non-invasive, and non-radiative imaging modality, which enables neurologists to investigate the connective micro-architecture of the white matter in a minimally inva-  (Beaulieu 2002), making it a highly informative probe of the fibrous white matter (Bihan and Iima 2015). The axon bundles of white matter locally exhibit clear preferential directions, as shown in Fig. 1a.
However, fiber tracking algorithms are required to reconstruct consistent long-range tissue connectivity from local, voxel-centric 1 DWI measurements.

Tractography
Long-range connections in the white matter are commonly referred to as streamlines, fibers or tracts. Algorithmic methods to computationally reconstruct such streamlines from DWI are known as tractography (Jeurissen et al. 2019;Nimsky et al. 2016). Schematically, tractography infers structural connections between voxels to answer questions like "Does there exist a structural connection between regions A and B?". We show a prototypical tractography result, also referred to as tractogram, in Fig. 1b.
Tractography is clinically applied to gather health information for a number of neurological conditions, especially for preoperative planning of neurosurgery, and for research on stroke and dementia impact on brain function (Yamada Tractography poses a major challenge due to its ambiguity mostly caused by partial volume effects, since axon diameters rarely exceed few micrometers, while the DWI resolution is limited to the scale of millimeters. This lack of resolution severely complicates inference of streamlines since the superposition of diffusion information renders it difficult to disambiguate locations where fibers cross, touch, or fan apart (Jbabdi and Johansen-Berg 2011). These complex fiber configurations have been observed to be highly prevalent in the white matter of human brains (Jeurissen et al. 2013) which further impedes data analysis of white matter especially in neurology.
The majority of tractography algorithms reconstructs streamlines in a local manner, i.e. they proceed iteratively from a given seed point, and greedily determine the direction of the next step based only on the local diffusion features, and information from previous direction estimates. Tractography algorithms can be distinguished into deterministic and probabilistic methods depending on how they estimate the direction of the next step. While deterministic methods compute a point-estimate of the next direction in line with the most-likely direction, probabilistic methods infer a distribution over possible directions. Sampling from this distribution supports following multiple traces along different directions at every step. In particular, probabilistic methods are able to express the uncertainty of their predictions, which also renders them more robust in the presence of noise.

Probabilistic Regression for Tractography
Recently, algorithms based on supervised machine learning (ML) have successfully extended the toolbox of local tractography methods. Even though these ML algorithms depend on the quality of the training streamlines, it has been shown by several works that ML models trained on fibers produced by another, unsupervised algorithm (teacher) can generalize very well to new DWI data, even improving over the teacher performance Wegmayr (2018), Neher et al. (2017), Benou and Riklin-Raviv (2019).
In this work, we present a probabilistic regression approach that avoids the conceptual problems of classification-based models such as direction discretization, and the lack of a closeness notion for directions. To define a proper regression model for d-dimensional vectors on the unit-sphere s ∈ S d−1 in a probabilistic framework, we propose a learnable posterior based on the Fisher-von-Mises (FvM) distribution (Mardia and Jupp 2000). Conditioned on the feature vector x ∈ X ⊆ R p , the posterior p FvM is given by where s, μ(x) denotes the scalar product between the random output direction s and the mean direction μ(x). C κ(x) abbreviates the normalization constant of p FvM . Besides the mean direction, the scalar concentration κ(x) is also a function of the input x, which accounts for input-dependent noise, heteroscedastic noise. In the context of tractography, x represents the local diffusion features, whereas s should be understood as the latent local direction of the fiber bundle. The functions μ(x), κ(x) are represented by neural networks, and their parameters are learned by minimizing the negative log-likelihood of observed reference streamlines.
Parameter inference for such a probabilistic approach amounts to a model selection problem and has to be carefully regularized to avoid an unbounded increase of the concentration κ(x) during model training, which would effectively reduce the model to its deterministic variant. While this effect is a common problem in many applications, both in tractography (Benou and Riklin-Raviv 2019), and outside of it (Sensoy et al. 2018;Kumar and Tsvetkov 2018), solutions are typically based on heuristics with ad-hoc penalty terms.
Instead, we derive a sound regularization scheme based on the information-theoretically optimal maximum entropy principle (Jaynes 1957). The resulting Gibbs distribution controls uncertainty by a precision parameter β that allows us to adapt the posterior uncertainty to the noise in the data. Even though the presented entropy-regularized FvM model applies to general spherical regression tasks with the need for uncertainty estimation, our focus is on applications to tractography, hence we refer to it as Entrack. Other pattern analysis applications of spherical or directional regression can be found in the prediction of word embedding vectors Kumar and Tsvetkov (2018), or object pose estimation from images Prokudin et al. (2018).

The Optimal Precision
While the precision β mentioned above enables us to regularize the global FvM posterior for streamline directions in all voxels, it is a priori not clear how to determine its optimal value. In particular, we are going to argue that common cross-validation techniques are not effective, because they measure the generalization error with respect to the posterior mean direction, whereas the precision only controls the width of the posterior distribution. Indeed, the smallest generalization error is achieved by the posterior distribution with infinite precision, which yields the well-known empirical risk minimizer as an estimator. Infinite precision implies minimal entropy, which means a sub-optimal robustness of the posterior distribution in the presence of noise. We also show that even more involved evaluation schemes, such as the Tractometer , are not a viable method to determine the optimal posterior precision, because their evaluation is still based on a single measurement instance, which is insufficient to estimate the influence of data fluctuations on the resulting tractograms.
Our model selection criterion requires at least two measurements to estimate the optimal posterior width relative to the data noise. Formally, this two instance scenario is described by the information-theoretic framework of expected log-posterior agreement (PA) (Buhmann 2010), which determines the optimal value of the precision parameter by maximizing the relative overlap between the posterior distributions on repeated measurements. We discuss its implementation in the context of tractography, and perform experiments on repeated DWI scans of the same subject to estimate the optimal precision.

Extension of Previous Work
This work is an extended version of our previous conference paper Wegmayr et al. (2019). We have extended, and reorganized the theoretical contributions about probabilistic directional regression, and entropy regularization (Sect. 3), including a novel annealing algorithm (Algorithm 1). Moreover, we propose the method of posterior agreement to determine the optimal precision (Sect. 3.5), and describe how to implement it for tractography (Sect. 4.3). The experiments are extended considerably, too, by investigating case studies of posterior estimation of local fiber direction (Sect. 5.3), and its relationship with fractional anisotropy (Sect. 6.2). Additionally, the evaluation on the Tractometer benchmark has been extended to include more competing methods (Sect. 6.3). Lastly, the experimental validation of posterior agreement on retest data also represents a new contribution (Sect. 6.5).

Overview
After summarizing related work in Sect. 2, we describe a probabilistic model for spherical regression, based on the FvM distribution, in Sect. 3.2.
To address the wide-spread problem of probabilistic overfitting, we introduce a regularized Gibbs free energy objective in Sect. 3.3, which controls the entropy of the posterior distribution via a precision parameter. We discuss its implications for model training, including an automatic annealing algorithm for parameter optimization in Sect. 3.4. Concluding the general description of methods, we present the expected log-posterior agreement for the FvM posterior in Sect. 3.5, which allows us to calibrate the precision parameter according to the noise level in the data. Section 4 presents the described methods in the context of streamline tractography, which is also indicated by the term Entrack. In particular, we define the models for DWI data, and their interpretation in terms of tractograms. Based on a factorization of tractograms into independent, piecewise linear streamline segments, we use the entropy-regularized regression model in Sect. 4.1 to learn the relationship between local fiber direction, and the diffusion data.
Using a step-wise tracking algorithm, we show how to use the local Entrack posterior to obtain long-range streamlines in Sect. 4.2. The calculation of the PA for tractograms from repeated DWI measurements is described in Sect. 4.3.
After providing details about DWI data preprocessing (Sect. 5.1), and the neural network implementation of the Entrack posterior (Sect. 5.2), we perform case studies on prototypical diffusion profiles in Sect. 5.3, to investigate the patterns learned by the model.

Local Tractography Algorithms
Local tractography algorithms, as opposed to global tractography (Reisert et al. 2011), reconstruct streamlines independently, in an iterative way, based on the DWI signal in a close spatial neighborhood. This design strategy has served as the core idea of many tractography algorithms since streamline tractography originally used Runge-Kutta methods to integrate the streamline progression (see Basser et al. (2000)).
Later, the works of Behrens et al. (2003), Friman et al. (2006) introduced local, probabilistic tractography models based on mathematical models for the posterior distribution of the streamline direction. While these probabilistic methods have proven to be robust to noise, they are computationally expensive, because they need to re-estimate the high-dimensional integrals involved in the posterior for every voxel.
Very recently, a new generation of models based on supervised machine learning (ML) entered the scene, and they promise to solve deficiencies of traditional models (Poulin et al. 2019). (i) ML-methods solve the parameter estimation problem only once over a representative set of examples during their training phase. Afterwards during inference, the algorithm only requires arithmetic evaluation of the model function at each voxel, which is efficiently achieved.
(ii) Moreover, ML-methods are better suited to capture complex patterns between DWI data and fiber direction in a non-parametric way than traditional approaches, which are limited by the richness of parametric statistical models.
However, ML-methods rely on fiber tracking examples to yield supervision information which is not required for traditional ("unsupervised") methods. To circumvent this issue, supervised approaches have been trained on the output of the previous state-of-the-art algorithms in traditional tractography; furthermore, well-curated training sets are becoming available in increasing numbers (Wasserthal et al. 2018;Essen et al. 2013).
Depending on the estimation technique for local streamline directions, ML models for tractography have been formulated either as regression problems or as classification tasks. Classification models Benou and Riklin-Raviv 2019) are probabilistic in nature, but require categorical classes to approximate continuous directions. In contrast, regression models (Wegmayr 2018;Poulin et al. 2017) provide the more appropriate representation for continuous directions, but we are not aware of probabilistic regression models in the context of tractography.

Uncertainty Quantification
Uncertainty in statistical inference arises in two distinctly different flavors -epistemic and aleatoric uncertainty -as described by Kiureghian and Ditlevsen (2009) for general engineering. Kendall and Gal (2017) discusses the estimation of epistemic and aleatoric uncertainty in the context of computer vision. The first one, epistemic uncertainty, refers to our uncertainty about the model parameters, and it decreases when more observed samples become available. The second one, aleatoric uncertainty, refers to input-dependent noise, which is inherent to the data distribution. As such, it is unaffected by the number of observed samples. Very recently, predictive models, which also incorporate estimation of aleatoric noise, have received increasing attention, e.g. for categorical classification (Sensoy et al. 2018).
Probabilistic regression has been addressed by Prokudin et al. (2018), who uses a mixture of 1-dimensional FvM distributions in the context of object-pose estimation, and by Kumar and Tsvetkov (2018) for sequence-to-sequence models for language generation. Similarly to this work, the latter proposes a probabilistic error function based on ddimensional FvM distribution, however, their focus is rather on reducing model training time than on uncertainty quantification.
Lastly, we also mention the method of Hauberg et al. (2015) for shortest-path tractography, who uses probabilistic numerics to solve Gaussian-process ODEs.

Expected Log-Posterior Agreement
The framework of expected log-posterior agreement defines a model selection method for algorithms and it was originally derived from information-theoretic principles (Buhmann 2010). More precisely, as described by Buhmann (2013), it measures the trade-off between informativeness, and stability of a cost minimizing algorithm in terms of the overlap that its posterior distribution exerts between repeated measurements. An algorithm's posterior distribution is considered informative, if it narrows down the set of potential solutions for each measurement, and it is considered stable, if the posteriors obtained from repeated measurements agree with each other in spite of measurement noise.
The PA framework, sometimes also referred to as Approximation Set Coding, or Gibbs posterior agreement, has been applied in various settings such as singular value decomposition (Frank and Buhmann 2011), spectral clustering (Chehreghani et al. 2012), Gaussian process regression (Fischer et al. 2016), and combinatorial optimization problems .
The PA criterion has also been applied in the context of neuroscience, namely to determine the optimal number of clusters for cortex parcellation (Gorbach et al. 2018).

The Fisher-von-Mises Distribution
The FvM distribution is a unimodal, directional distribution defined on the d-sphere S d−1 . For random unit vectors s ∈ S d−1 , the FvM density is given by with s, μ = d i=1 s i μ i , and the normalizing constant where I n (.) denotes the modified Bessel function of the first kind. The FvM distribution is parameterized by the unitlength mean direction μ ∈ S d−1 and the scalar concentration κ ∈ R + . We illustrate the d = 3 dimensional FvM density for three different concentration parameters κ in Fig. 3. The norm of the first moment W (κ), and the entropy H (κ) of the FvM distribution are given by We illustrate both functions for d = 3 in Fig. 4, and note that in contrast to the mean direction μ, the norm of the first moment, i.e. W (κ), can be smaller than 1. Indeed it vanishes in the limit of very small concentration κ → 0 when p FvM approaches the uniform distribution proportional to the inverse surface of the d-sphere: For very large concentration, the FvM distribution contracts at the mean direction: where δ denotes the Dirac measure.

Probabilistic Regression with the FvM
In spherical regression, we want to estimate the regression function y : X → S d−1 , x → y(x), which maps the feature space X ⊆ R p to the d-sphere S d−1 . As the feature vectors x ∈ X are drawn from a distribution p(x), the observations y(x) are random variables. The estimated regression function is denoted as μ, and as the involved vectors have unit-length, the squared distance between a prediction μ(x) and the corresponding observation y(x) effectively reduces to the negative cosine loss, when disregarding constant terms: The loss in Eq. (7) is −1, if the prediction points into the same direction as the observation, and 1 if they are antiparallel. To obtain the corresponding probabilistic regression model, we additionally introduce a function κ : X → R + , which acts as the concentration parameter of a predicted FvM distribution p FvM (. | x) := p FvM (. | μ(x), κ(x)). The loss of the functions μ(x), κ(x) is the negative log-likelihood of the direction y(x) under the corresponding FvM distribution: The functions μ, κ are typically parametrized, e.g. in terms of neural networks. Given inputs x ∈ R p , the simplest such example would be which define the parametrized FvM posterior In our experiments, the neural networks are more complex than Eq. (9) , but it effectively plays the same role. Given a training set {(x i , y i )} i=1...n (∀i : y i := y(x i )), the parameters ϕ are estimated by minimizing the empirical risk function The risk function inherits the property of loss attenuation from the loss function in Eq. (8), which means that the loss caused by a large deviation − y(x), μ(x) is reduced by a low certainty κ(x). We illustrate the effect of loss attenuation for the FvM distribution (d = 3) in Fig. 5. The attenuation property ensures an increased robustness to outliers due to an adaptive sensitivity of the loss function. Furthermore, the concentration κ(x) is a function of the input and this fact enables us to assess the certainty of the predicted direction for any sample x. However, in practice, these benefits of a probabilistic formulation will be severely reduced by overfitting. When the model complexity is large, e.g. for neural networks, the posterior p FvM ϕ can perfectly minimize the training risk, in particular its concentration estimates will be biased towards large values, as we can see from the gradient of the risk with respect to the parameters ϕ κ : which tends to zero asymptotically ∀i : μ ϕ μ (x i ) → y i , and ∀i : κ ϕ κ (x i ) → ∞, recalling that lim κ→∞ W (κ) = 1. This trend is also documented in Fig. 5, where the concentration that minimizes the loss evolves towards large values for improved model fit. Eventually, the FvM distribution fitted to the training data will concentrate all its probability mass on the mean direction. As a consequence, the probabilistic model degenerates to the deterministic limit of Eq. (6), and this behavior is undesirable for down-stream information processing, where often access to typical samples is required rather than simply extracting the most-likely direction. Moreover, at large concentrations the entropy of the output distributions is also minimal, which is detrimental for its robustness to noise, as we will discuss, together with a principled solution, in the next section.

Entropy Regularization
In the previous section we argued and demonstrated that probabilistic models can treat experimental settings with noise more effectively than their deterministic counterparts. Still the essential question remains open: How robust are different parametric distributions to the fluctuations generated by a particular data source?
The Maximum Entropy Principle, well known in physics and information theory (Jaynes 1957), provides an answer to this question of model uncertainty. In contrast to maximumlikelihood estimation, which requires assumptions about the parametric form of the desired distribution, the maximumentropy approach is based on the knowledge about moments of the desired distribution. The maximum-entropy distribution obtains its robustness from the fact that it is the least informative distribution, which still fulfills the known constraints.
To put it differently, the change of the maximum-entropy distribution with respect to perturbations of the constraints is the least possible, as it avoids any overspecification, which is not supported by the data. In the context of spherical regression, the constraints are represented by the observations x and the observed direction y(x). Each observation provides a constraint in the entropy maximization over posterior distributions p(s | x) in the following sense: where E p(s|x) denotes the expectation with respect to the The constraint can also be written as E p(s|x) s, y(x) = w, which also shows that w should be interpreted as the amount of spread that p(s | x) has around the observations y(x). Using a Lagrange multiplier β ≥ 0, we can rewrite the constrained optimization problem in Eq. (14) as an unconstrained problem min p(.|x) g β y(x), p(. | x) , with The functional g β represents the Gibbs free energy, which is the difference between the expected cost −E p(s|x) s, y(x) , and the entropy −E p(s|x) log p(s | x) divided by the Lagrange factor β, controlling the precision of the direction estimation. The precision is determined by the value w in the constraint of Eq. (14), and needs to be considered as a hyper-parameter if we only observe y(x).
By variational calculus, we can derive the distribution that minimizes Eq. (15) for one particular x: which corresponds to the FvM distribution with a fixed concentration β around the mean direction y(x). While it is well known that for the constraints in Eq. (14) the maximumentropy distribution is given by the FvM (Mardia 1975), we are rather interested in learning the conditional distribution  18), crosses) as a function of κ, at different precisions β. We show two different, but fixed deviations y, μ (× and +). Colors indicates low β (red), and high β (blue). We omitted the solid lines for y, μ = 0.7 to avoid clutter. Figure  p(. | x) over all x, which minimizes the expected Gibbs free energy for an arbitrary, but fixed data distribution p(x).
Based on the observation that the FvM functional in Eq.
, which replaces the constant β with the input-dependent concentration κ(x), and the unknown function y(x) with the estimator μ(x). If we plug this into Eq.
(15), we obtain the proposed entropy-regularized loss function for the estimators μ(x), κ(x): The entropy-regularized objective in Eq. (18) has a similar loss attenuating property as the loss based on the FvM loglikelihood from Eq. (8). As shown in Fig. 6 for d = 3, the free energy is minimized at lower concentration κ(x), when the deviation y(x), μ(x) increases.
However, whereas the concentration diverges in the maximum-likelihood approach when y(x), μ(x) → 1, it remains finite with the maximum-entropy method even in this case, because the precision parameter β limits the concentration, as we will discuss in more detail in the next section.

Automatic Annealing Schedule
In analogy to Sect. 3.2, we denote the parameters of the parametrized FvM posterior as ϕ β . We propose to determine these parameters by minimizing the expected Gibbs free energy G β (Eq. (17)). In a learning setting, we substitute the data distribution p(x) in G β by the empirical distribution of the observed data x i and y i := y(x i ). The estimated parametersφ β of the FvM distribution arê To determine the optimal precision parameter β, we need to obtain the posterior parametersφ β at different values of β. While this sounds conceptually straightforward, more considerations are necessary in practice. To compare models at different precision values, we need to assert, that the optimization of the model parameters ϕ β has indeed equilibrated at the given precision value β. More formally, we consider the model parameters ϕ β in equilibrium at a given precision β, when the following condition holds: This condition can be motivated by the gradient of the risk with respect to the parameters of the concentration, i.e.
which shows that the equilibrium condition in Eq. (20) is fulfilled, if the gradient vanishes. Additionally, we see again that the concentration κ remains finite, and is limited by the precision β, in contrast to Eq. (13). In practice, it will depend on the optimization parameters (learning rate, batch size, etc.), and the precision itself, if the condition Eq. (20) is approximately fulfilled. Besides the cumbersome tuning of optimization parameters, it is very time-consuming to re-run the optimization for each precision value with a new initialization of ϕ.
Thus, we propose a robust, automatic annealing schedule to efficiently produce models in equilibrium at different precision values. The detailed annealing procedure is described by Algorithm 1, and its effect is illustrated in Fig. 7. Effectively, the progress of optimization is automatically paced by using Eq. (20) as a control criterion. As long as equilibrium is not established for a particular precision value, this value is held constant until the optimization of the model has equilibrated. This computational strategy renders model adaptation more robust to the choice of optimization parameters.
Moreover, we can efficiently extract models for different precision values during the course of a single training run, without re-initializing the parameters. Figure 8 illustrates the joint distribution of κ(x i ), and y i , μ(x i ) at one point during a typical annealing process, which shows that the model is indeed approximately at equilibrium. To see the effect of β on the optimization of the parameters of μ more clearly as well, we also consider the gradient of the loss in Eq. (18) with respect to μ. To maintain the unit-length constraint, we assume that μ = z/ z 2 . Thus, the gradient is given by where we have written z, κ, y instead of z(x), κ(x), y(x) for brevity. The gradient vanishes when z = y, and the optimum with respect to μ does not depend on the precision β.
In practice, however, when we use gradient-descent to optimize the risk function, the magnitude of the gradient will be multiplied with W κ , which clearly depends on β. So while the optimum for μ is still the same, we see that the precision
end while 10: ← ∪ ϕ 11: β ← γ · β 12: end while 13: return influences the effective learning rate for the parameters of μ. At the beginning of the annealing schedule, the factor W (κ) is small, because the precision is small (see also Fig. 7), i.e. the parameters of μ are less susceptible to the deviation from the target y. As the precision is gradually increased, the effective learning increases as well, and the gradient updates of μ will push it stronger towards y.
To recapitulate, we have discussed how the precision parameter β constrains the average concentration of the FvM posterior during training, and how we can consistently, and efficiently obtain model parameters at different levels of precision. In the next section, we address the question of how to determine the optimal precision based on the noise in the data.

Optimal Precision by Posterior Agreement
We have seen in the previous two sections, that the proposed entropy regularization effectively limits the concentration κ(x), however, it introduces the undetermined precision hyper-parameter β.
A common strategy to determine hyper-parameters would be cross-validation with respect to the generalization error − i y i , μ(x i ) on a validation set. However, cross-validation of this kind does not provide a solution here, because we have shown in the last section that the optimum of the function μ(x) is not affected by the precision (Eq. (22)).
One could object, that the generalization error does not entirely reflect the learned posterior, but only its mean direction, and that we should rather compute the expected generalization error i ρ β (x i , y i ) with where we have defined p FvM , and κ β analogously. Moreover, note again that μ does not carry the precision subscript to indicate that it does not depend on β. Even though the expected generalization error depends on the precision, it does not have an optimum for finite β, as illustrated in Fig. 9. Instead, the minimum is always achieved at β → ∞, which corresponds to the well-known empirical risk minimizer. This argument shows that we can not determine the optimal width of the posterior by minimizing risk since it introduces a bias which underestimates uncertainty.
Instead, we need a criterion, which can assess the stability of the posterior distribution with respect to data fluctuations. For this purpose, we propose a method motivated by the information-theoretic framework of expected log-posterior agreement. It is applicable to any Gibbs posterior distribution, if we have access to repeated measurements, which are used to assess the noise level in the data, and to calibrate the precision β accordingly. Specifically, we require a validation set, which provides two independent realizations In the context of spherical regression, we define the PA for one repeated measurement as where C(0) is the normalization of the uniform distribution as defined in Eq. (5). Performing the integration over all direc- tions s, the PA reads where we have written κ β = κ β (x ), etc. for brevity. The maximal agreement is realized between the following two limiting cases: (i) When the posterior distribution is very broad, i.e. it does not contain any information about the mean direction, the PA is zero: (ii) For highly peaked posteriors with different mean directions due to noisy measurements, the agreement also vanishes: Indeed, if we increase β sufficiently high, the agreement integral drops below the value C(0) achieved in the uniform case, and the posterior agreement i β (x , x ) becomes zero again. We note, that the PA also vanishes for β > 0, if either of the two measurements is uninformative, i.e. uniform by either κ β = 0 or κ β = 0.
In Fig. 10 we illustrate the behavior of i β when κ β = κ β = β, showing clearly the discussed trade-off between low precision and high precision.
To determine the optimal precision based on a validation set, we compute the average over all n repeated measurements and maximize it with respect to β: Moreover, we can assign an interesting interpretation to the numeric values of iβ . Let us consider the solid angle Ω p (β) centered on an arbitrary, but fixed μ, which contains p% of the probability mass of a general p FvM (s | μ, β) distribution. We use it to partition the unit sphere into 4π/Ω p (β) effective, conic bins. In this sense, the precision β determines a quantization angle on the sphere. If we measure the number of effective, conic bits with the binary logarithm, we find that as shown by the black line in Fig. 10. This result is interesting, because it suggests that the value of i β corresponds to the bit-rate of a noisy communication scenario, as described in the original, information-theoretic derivation by Buhmann (2010).
Besides, we can use it to define an intuitive scale for the directional precision by calibrating it with respect to β θ , where θ is the aperture angle of the solid angle Ω 99.5 (β). More precisely, the relationship between β and θ is given by Unless indicated otherwise, we scale all experimental precision values with respect to β 15 • = 114.40.

The Entrack Posterior for Tractography
In this section, we apply the entropy-regularized, probabilistic regression objective from Eq. (18) to streamline tractography on DWI measurements.
A DWI measurement I records the diffusion signal at every location r in the measurement volume V ⊂ R 3 2 , i.e. I : V × (S 2 ) N → R + , (r, g n ) → I r (g n ). Essentially, I r (g n ) corresponds to the magnitude of the local diffusion signal along one of the N experimentally fixed magnetic gradient directions g n . The diffusion signal exhibits an invariance to inversion of the gradient directions g n , i.e. I r (−g n ) = I r (g n ). In practice, it is common to work with a lower-dimensional feature representation X f of the highdimensional DWI measurement, i.e. X f : V → R p , r → ω r such that f (g n | ω r ) = I r (g n ). The function f is an experimental aspect of tractography, and we discuss its concrete implementation in Sect. 5.1, but for the following considerations we assume it as fixed, i.e. X := X f . By measuring the DWI features X of the underlying brain tissue T , the goal of a tractography algorithm A is to recover the corresponding long-range tissue connections T, also referred to as tractogram: which should be understood as a representation of tissue connectivity rather than an anatomically faithful image of individual axons.
To learn the tractography mapping A : X → T, we factorize the joint posterior p(T | X) of an entire tractogram into the product of independent streamlines t i . Moreover, we factorize each streamline into the product of its segments y i, j ∝ r i, j − r i, j−1 , but retain nearest-neighbor interactions between successive segments. Thus, the posterior probability of the direction y i, j , described by the FvM distribution, is conditioned on the diffusion data X(r i, j−1 ) at the location r i, j−1 , and the incoming direction y i, j−1 : Due to the tractography context, we refer to the posterior p trk in Eq. (32) as Entrack posterior. Under these assumptions, we can write the joint tractogram posterior as where we also have made explicit the need for priors of the fiber seed points r i,1 , and the initial directions y i,1 .

Entrack: Learning the Local Posterior
Given a measurement of DWI features X, and a corresponding reference tractogram T as supervision information for training, our goal is to learn the posterior distribution of local streamline direction p trk (y | x, y in ) based on the entropyregularized Gibbs free energy presented in Eq. (18). Note, that we have denoted x ∈ R p as the variable for the local diffusion data, and y in ∈ S 2 as the variable for the incoming fiber direction. Together, they constitute the input vector x = (x, y in ) ∈ R p × S 2 , which the Entrack posterior conditions on.
In the following, we discuss how to decompose the data set (X, T) such that it can be used with the risk function introduced in Eq. (19). Specifically, we need to construct samples (x i , y in i ), y i to capture the relationship between the target direction y and the input (x, y in ). We detail the corresponding sample generation process in Algorithm 2, and illustrate it in Fig. 11. With an accordingly generated training set X, we can use the entropy-regularized risk function from Eq. (19) to estimate the parameters of p trk β := p trk ϕ β . However, we also need to account for the inversion symmetry of DWI data, which makes it equivalent to traverse a streamline in forward, and backward direction. To ensure that the posterior can learn this symmetry from the data, i.e. p trk (−y | x, −y in ) = p trk (y | x, y in ), we incorporate this invariance explicitly in the risk function by adding forwards (u = +1), and backwards direction (u = −1): where N refers to the number of sample streamline segments.

Fig. 11
Generation of a training sample from a reference streamline. For each point along a fiber we extract the local diffusion data X(r j−1 ) (red square), and the incoming direction y j−1 (red arrow) as inputs, whereas the outgoing direction y j serves as target (green arrow) (Color figure online).

Entrack: Streamline Inference
The trained FvM posterior p trk β is employed by the iterative tracking algorithm as described by Algorithm 3.
To construct a streamline, we start from a seed point r 1 ∈ V, and obtain the local DWI features X(r 1 ). Provided with a prior direction y 1 ∈ S 2 , e.g. from a diffusion-tensor fit, we can establish the next point r 2 of the streamline by sampling a direction y 2 from p trk β y | X(r 1 ), y 1 , and setting r 2 = r 1 + αy 2 , with a step size α. This iteration repeats, until a termination criterion is met, such as a thresholds on fiber length, strength of the diffusion signal, fiber bending angle, or leaving a predefined region of interest (ROI). The corresponding streamline t i simply consists of the traversed points, i.e. t i = (r i,1 , . . . , r i,n i ).
To obtain a dense tractogram T = {t i } i=1...n , we place n seed points within a specified ROI, e.g. within a white matter mask for whole-brain tractography.

Entrack: Posterior Agreement
In Sect. 3.5 we introduced the PA for general directional regression with the FvM posterior, and now we describe how we implement it for tractography to determine the optimal β for p trk β . Given two independent DWI measurements X , X of the same subject, we denote the corresponding tractograms, obtained with Algorithm 3 in conjunction with p trk β , as T β , T β . The tractograms carry the subscript β, because they implicitly depend on the precision via the Entrack posterior

Algorithm 3 Iterative Streamline Tractography
Require: DWI features X, seed point r 1 , prior direction y 1 , posterior of local fiber direction p trk β y | x, y in , step size α. Ensure: Predicted streamline t.
1: r ← r 1 2: t ← {r 1 } 3: y in ← y 1 4: y ∼ p trk β . | X(r 1 ), y 1 5: while terminate r, y, y in , X(r), t = True do Tracking Iteration 6: r ← r + αy 7: t ← t ∪ r 8: y in ← y 9: y ∼ p trk β . | X(r), y in 10: end while 11: return t p trk β . We recall that the PA i β (x , x ) from Eq. (25) depends on two measurements x , x of the same input, and the input to the Entrack posterior consists of two components, i.e. x = X(r), y in . Given proper image registration between X and X , it is straightforward to match the repeated measurements of the DWI data by considering the same location, i.e. X (r), X (r).
To obtain y in (r) , y in (r) from the discrete streamlines of the two tractograms T β , T β we consider a small volume around the location r, and compute y in (r) , y in (r) based on the corresponding streamlines which pass through this voxel. Thus, the continuous measurement volume V is decomposed into little cuboids with voxel size a > 0, that are indexed by their discrete location inside the measurement volume, i.e. z ∈ V a = V ∩ {z · a | z ∈ Z 3 }. We refer to the volume of a voxel as v z = {r : r − z ∞ ≤ a/2}. Even though we can now compute y in (z | T) ∝ i, j I{r i, j ∈ v z }y i, j , we still need to take into account that independent streamline bundles may cross at the same voxel, and they must not be confused with each other.
Instead we need to consider each bundle b separately, and condition the local direction on the bundle, too, i.e. y in (z, b | T). More precisely, we consider a bundle b as a set of coherent streamlines, which are similar in the sense that the average pointwise distance is small for each pair of streamlines in a bundle. This way, we can partition a tractogram into a set of bundles We also refer to Garyfallidis et al. (2012) for more details about the practical grouping of tractograms into bundles. Using the partitioned tractogram B, we can compute the local streamline direction per-bundle, up to normalization to unit-length, as with y i, j = (r i, j − r i, j−1 )/ r i, j − r i, j−1 2 . Essentially, we have decomposed the tractogram T into the directions of its individual fiber bundles at each voxel, which is also known as fixel representation (Raffelt et al. 2017), referring to "a specific fiber bundle within a specific voxel". Consequently, we can think of each tuple (z, b) as the coordinates of one fixel. We define the posterior mean direction of such a fixel as where μ is the mean direction of the Entrack posterior p trk β . Lastly, when we compute the posterior concentration of a fixel, i.e κ β (z, b | X, T β ), we also need to take into account the number of streamlines which represent the summary direction y in (z, b | T β ). Intuitively, we should be more certain about the summary direction, if it is represented by many fibers, i.e. the concentration should be increased 3 . Formally, the fixel concentration is scaled by the streamline density, i.e.
where the streamline density is defined as In particular, the fixel concentration κ β (z, b | X, T β ) is zero, i.e. the posterior doesn't contain any information about the fixel direction, when we don't observe any streamline.
Putting everything together, we obtain the posterior agreement for one fixel, based on Eq. (25), as where we have defined κ β (z, b) := κ β (z, b | X , T β ), etc. for brevity. Consequently, the average PA over all fixels is given by with the set of bundles that intersect a particular voxel denoted as B z = {b ∈ B(T β ∪ T β ) : ∃t ∈ b : ∃r ∈ t : r ∈ v z }.

Experiments
We provide the entire code implementing this work at https:// github.com/vwegmayr/tractography, which includes code for managing data acquisition, data preprocessing, sample generation, model training, model inference, and evaluation.

Data and Preprocessing
In the following we summarize the most important details about the DWI data and its preprocessing, i.e. how we obtain the DWI features X.

ISMRM15 Data
The simulated DWI data, that was also used in the ISMRM15 challenge, can be obtained from http://tractometer.org/. We use the DWI data referred to as "basic data" on the challenge website. The corresponding DWI image has the shape 90 × 108 × 90 × 33, with 32 gradient directions b = 1000 s/mm 2 , plus one acquisition with b = 0 s/mm 2 . The voxel size is 2 mm. We preprocess the DWI image according to the standard preprocessing pipeline described by Glasser et al. (2013), using the MRtrix tool (https://mrtrix.org/). This procedure includes the following steps, where we indicate the corresponding MRtrix commands in parentheses:

HCP Data
The HCP diffusion data, accessible at the website https:// db.humanconnectome.org/, are already preprocessed according to the standard preprocessing pipeline by Glasser et al. (2013). The DWI image has the shape 145×174×145×108, and we extract 90 gradient directions with b = 1000 s/mm 2 , plus 18 interlaced acquisitions with b = 0 s/mm 2 . The voxel size is 1.25 mm.
4 Please refer to appendix C for more details.
We perform the same procedure to estimate the per-voxel FOD as described for the ISMRM15 data.

TractSeg Streamlines
The TractSeg dataset (Wasserthal et al. 2018) is a collection of high-quality white matter reference tracts for 105 subjects, whose diffusion data is also included in the HCP dataset. It can be downloaded at https://zenodo.org/record/ 1477956. Each reference tractogram contains ∼ 1.7 million fibers, grouped into 72 reference bundles, which amount to ∼ 70 million fiber segments in total.
For training, we use the tractogram of subject 992774, and reduce it to 20% of its size by sub-sampling the streamlines, weighted by bundle-size to ensure that small bundles are not under-represented.

Entrack Model Architecture and Training
In this section, we discuss our implementation of the Entrack posterior p trk y | μ X(r), y in , κ X(r), y in , in particular the implementation of the functions μ, κ.
While the general formulation supports a wide range of possible functions, we chose a deep neural network model due to its superior ability to extract patterns automatically (Goodfellow et al. 2015). Moreover, neural networks (NN) naturally support modular architectures, which allows us to readily formulate μ, κ in terms of two output modules NN μ , NN κ , based on a shared NN module NN z : μ X(r), y in = NN μ z X(r), y in κ X(r), y in = NN κ z X(r), y in z X(r), y in = NN z X(r), y in (41) Specifically, each NN module is a series of fully-connected layers, as shown in Fig. 12, together with the detailed parameters. The inputs X(r), y in are both flattened, and concatenated to form a 408-dimensional input vector, i.e. 3 dimensions for y in , and 405 = (3 × 3 × 3) × 15 dimensions for X(r), which represents the 15 FOD coefficients (l = 4) for each voxel in a 3 × 3 × 3 cube centered on the location z = a[r/a], where a is the voxel size.

Model Training
The described NN model for the Entrack posterior is trained on samples obtained from the TractSeg streamlines of subject 992774, using the sample generation procedure described by Algorithm 2. For parameter optimization, we use the annealing scheme described in Algorithm 1 with η = 0.99, = 0.01, β 0 = 10, β s = 1000. Moreover, we use the Adam optimizer (Kingma and Ba 2014) with a learning rate of 2 · 10 −4 , and a batch size of 512. Note, the number of training epochs depends on how fast the annealing proceeds, but in our experiments it typically reached the target precision within 30 epochs.

Local Case Studies of Entrack Posterior
To better understand which patterns the Entrack model has recognized in the training data, we perform a series of experiments on prototypical inputs.

Single Fiber Direction
In this setup, we investigate how μ X(r), y in , and κ X(r), y in behave when we rotate y in , while keeping X(r), and β fixed. For this purpose, we select DWI features X(r) from a voxel in the corpus callosum (see inset of Fig. 13a), which exhibits a clear, unidirectional DWI signal visualized by the gray FOD in Fig. 13b.
In Fig. 13a we consider the change of κ, when y in is rotated in-plane, relative to the fixed DWI input. More precisely, the figure shows the function κ X(r), R θ e x , where R θ is a 3×3 rotation matrix whose rotation axis is perpendicular to the plane of view, and e x = (1, 0, 0).
We observe, that the concentration of the Entrack posterior is largest along the DWI main direction, and decreases when perpendicular. This behavior makes sense, because we expect the uncertainty to be small, when the incoming fiber direction agrees with the local diffusion data, and to increase when they disagree.
Moreover, the angular profile is approximately inversionsymmetric, as should be expected from the properties of DWI data. In Fig. 13b we consider the probability of proceeding along y in , i.e. log p trk (y in | X(r), y in ), again with y in = R θ e x .
As expected, the probability to follow the previous direction, is the highest when it is aligned with the diffusion data.
However, it is also high, when y in is perpendicular to the main direction of diffusion. We can understand this unexpected behavior in the sense, that the model recognizes situations where the incoming direction clearly contradicts the present direction of diffusion. But instead of predicting a suboptimal superposition between the previous direction and the DWI main direction, the non-linear model favors continuity with respect to the incoming direction.
This interpretation is also supported by Fig. 13c, which shows the amount and direction of deflection of μ from y in . The two black arrows in the figure represent one exemplaric pair y in and μ(x, y in ) to illustrate the deflection. The exemplaric incoming direction has an incidence angle of about θ = 45 • , for which we read off a deflection of ca. +20 • , as shown by the radius and color of the intersecting lobe. Thus, the mean-direction predicted by the model is rotated 20 • clockwise with respect to the incoming direction.
This example shows, that the model pushes the incoming direction closer to the main direction of diffusion, if they sufficiently agree. In contrast, when the incoming direction does not relate to the diffusion data (e.g. at θ ≈ 90 • ), the model predicts no deflection, but rather follows the previous direction, effectively implementing a continuity prior. We provide a similar case study, which supports the same conclusions, but for crossing fiber directions, in appendix F.

Influence of Precision
In this experiment, we investigate how the local logprobability profile from Fig. 13b changes as a function of the precision β. For this purpose, we visualize the profile of log p trk β (y in | X(r), y in ) for different values of β in Fig. 14. As expected, the sensitivity of the posterior to the details of the data increases with the precision. More precisely, the dependence of log p trk β R θ e x | X(r), R θ e x on θ is strongly modulated by the DWI data for high precision β, and tends to be isotropic, i.e. insensitive, for small values of β. This observation illustrates nicely the concept of precision: At low precision, the output distribution is broadened, taking into account only the strongest part of the data signal. On one hand, this smoothing renders the posterior robust to data fluctuations, on the other hand, it suppresses fine details in patterns. It only starts to take into account more details, when we increase the precision. Thus, the posterior will capture higher-order patterns in the data, but it will also be more susceptible to noise. Fig. 14 Log-likelihood log p trk β (y in | X(r), y in ) for low precision β (red), and high precision (blue). The inset FOD (gray dumbells, center) illustrates the fixed input data X(r). The polar angle θ controls the orientation of y in = R θ e x (Color figure online).

Whole-Brain Tractography
In the previous section we studied local properties of the Entrack posterior p trk β (y | X(r), y in ). In this section, we focus on its performance in the context of whole-brain tractography, i.e. the iterative tracking procedure of Algorithm 3. In particular, we are interested to determine the optimal value for the precision β. For all whole-brain tractography experiments we use the posterior model p trk β trained on HCP subject 992774, as described in Sect. 5.2. Experimental Parameters We describe the concrete experimental parameters required for Algorithm 3 to produce a whole-brain tractogram. Besides the posterior p trk β , several other, influential factors are involved in the iterative prediction: -Interpolation: The original ISMRM data comes at a voxel size of 2 mm, we up-sample it to the resolution of the HCP data (1.25 mm), using trilinear interpolation. -Seeds: We place one seed point at the center of every voxel inside a white-matter mask, which was thresholded at a value of 0.1. -Prior: We use the main principal axis of a diffusiontensor fit as initial incoming direction y 1 .
To address the ambiguity about the sign of the prior direction, each streamline is propagated in both directions. -Step Size: We use a step size of 0.25 mm, i.e. 1/5 of the voxel size. -Length Constraints: Fibers are automatically terminated after 800 steps, and we only retain streamlines with a length between 30 mm and 200 mm. -Fiber Termination: Besides termination by length, we also terminate fibers when they arrive at a voxel outside of the white matter mask. -Fiber Filtering: Besides the length restriction, we do not further filter the predicted fibers, e.g. by curvature, etc.

ISMRM15 Phantom and The Tractometer
The Tractometer (TM) is an evaluation tool for tractography results (Côté et al. 2013;Maier-Hein et al. 2017), and it served also as comparison measure in the ISMRM15 tractography challenge. It is based on a simulated DWI phantom of the brain, which was generated using 25 carefully prepared fiber bundles, which mimic the complex fiber arrangement in the white matter (Poupon et al. 2010;Neher et al. 2014). A cross-sectional view of these bundles is shown in Fig. 1b. The TM defines two sets of metrics, which assess on one hand the quality of long-range connectivity, and on the other hand the bundle fidelity of predicted fibers. The first group of metrics includes valid bundles (VB), invalid bundles (IB), valid connections (VC), invalid connections (IC), and non connections (NC). The second group of metrics includes mean overlap (OL), mean overreach (OR), and mean F1 score (F1). Please refer to appendix D for more details about these evaluation scores.

Precision Dependence of TM Scores
In Fig. 15, we present the TM metrics as a function of the precision β. As a major observation, the TM scores do not seem to suggest a consistent optimal precision. The VCscore increasingly saturates to a maximum value of 0.52 (β/β 15 • = 1.58), while the maximum F1-score of 0.54 (β/β 15 • = 0.42) marginally decreases by 2.5% over the same range. The VB-score toggles between 23 and 24, and remains stable otherwise. The IB-score is the least consistent, with a local minimum of 123 at β/β 15 • = 0.66, however its total variation over the range of β is also very small ≈ 5%. This observation indicates, that besides the lack of a clear optimum with respect to the precision, the TM scores are also not very sensitive to β, except the VC-score, which increases by 20%. The behavior of the VC score rather suggests a comparison with the generalization error, described in Eq. (23), which also saturates for β → ∞, and does not have an optimum at finite precision.

Posterior Concentration and Fractional Anisotropy
A common quantitative measure of how much the diffusion signal is confined along a single direction is the fractional anisotropy FA ∈ [0, 1]: 5 It relates to the eccentricity of the diffusion ellipsoid, and it is 0 for an isotropic sphere, which is characteristic for voxels with ambiguous DWI measurements, whereas it is 1 for voxels with a diffusion ellipsoid clearly elongated in one direction. In Fig. 16, we show that the posterior concentration κ x, y in has indeed a strong correlation with FA(x) (r = 0.42), which means that there is a strong link between model certainty, and the articularity of the diffusion signal.

Comparison to the State-of-the-Art
To provide an absolute reference point for the presented TMscores, we include an overview of the scores achieved by 5 Please refer to appendix C for more details.

Fig. 15
Tractometer scores S as function of the precision β. Each score is scaled by its maximum S max (see legend for values) for better comparability. current tractography solutions based on supervised machine learning in table 1. Besides the row "ISMRM15 ∅", which represents the average over all teams of the ISMRM15 challenge (ML and non-ML), we have divided the results into two groups. The first group, marked with an asterisk, represents results where the model was trained on the synthetic DWI phantom, and these data are also used for evaluation. The work of Neher et al. (2017) refers to this setting as in silico→in silico, meaning that training fibers for these algorithms were obtained on the phantom data by another state-of-the-art algorithm. Even though the training fibers do not exactly correspond to the evaluation fibers, there exists a strong statistical dependence and we can not consider this setting as a valid generalization test. Instead, the model should be trained, and evaluated on two independent instances of (synthetic) DWI data. But as the ISMRM challenge provides only one instance, models should be trained on real DWI data, e.g. from the HCP. As expected, the results in this setting fall behind compared to the results in the in silico→in silico setting, which should be considered as (overly) optimistic estimates of the generalization performance. Aside from this criticism, it is apparent that all algorithms fail to consistently outperform the competitors in the realistic in vivo→in silico setting. Instead, we can observe a strong trade-off between OL and VC/IB. On one hand, the work of Wegmayr (2018) achieves very good VC/IB (72%/57), but poor OL (16%), on the other hand Entrack (sample) and Neher et al. (2017) achieve much better OL (58% and 59%, respectively), but poorer VC (52% and 52%, respectively). The results for the Entrack model were obtained at β/β 15 • = 1.58. A similar trade-off is seen between the (sample)/(mean) variants, which refer to how the fiber directions are obtained from the posterior during streamline progression. At each tracking step, the (sample) variant draws a random direction from the posterior, while the (mean) variant always chooses the most likely direction. The Entrack (sample) method achieves superior bundle coverage compared to the (mean) method (OL 58% vs. 45%), but at the cost of more false-positives (IB 123 vs. 116). The same is true for the FvM model, which has the same architecture as the Entrack model, but is trained without entropy regularization, i.e. using the probabilistic regression objective in Eq. (12). The bundle coverage of the FvM (sample) model is also better than its (mean) variant (OL 53% vs. 48%), but also at the cost of more false-negative bundles (IB 154 vs. 112).
Moreover, we highlight that the TM scores support a model ranking Entrack (sample) FvM (sample) Detrack Classifier (mean), which denotes the respective benefits of entropy regularization, a probabilistic loss, and a regression model. The Detrack model has the same neural network architecture as FvM and Entrack, but without the output for κ, and it is trained with the standard negative cosine loss of Eq. (7). The Classifier model also shares the same neural network architecture as all the other models, but it has a softmax output over directions, and is trained by the usual cross-entropy loss for classification. We note, that each of the listed models should be understood as a module in the complete pipeline described by Algorithm 3. Such a pipeline is controlled by various other significant influence factors (training data, seed points, etc.) that are different in each case, thereby limiting comparability. We can only assert that the results of Classifier, Detrack, FvM, and Entrack have been conditioned on the same pipeline, so that their differences can be attributed indeed to the respective choices of the objective functions.

Qualitative Results on ISMRM
In addition to the evaluation metrics provided by the Tractometer, we present qualitative tractogram visualizations. In Fig. 17 we show an overview-section of the whole-brain tractogram obtained with the Entrack model on the ISMRM data, which should be compared to the ground-truth fibers in Fig. 1b. Additionally, to facilitate a more detailed analysis, we have computed the voxel mask of the predicted corticospinal tract (CST), and visualize its overlap, overreach, and underreach with respect to the ground-truth bundle in Appendix G. Lastly, we demonstrate visualizations of the heteroscedastic uncertainty estimated by the Entrack posterior in Fig. 18. On one hand, we can visualize the spatial dependence of κ X(r), y in (Figs. 18a and b), and on the other hand, we can compute per-fiber statistics, such as the log-probability per streamline, i.e.
shown in Figs. 18c and d. As we have discussed before, the concentration parameter measures the degree of certainty that the model encodes on the fiber direction at a given location.
In Fig. 18a we can observe that the concentration/certainty is larger at the core of bundles than in the periphery, which agrees with the fact that the diffusion data is less ambiguous at the bundle cores than at the boundaries. Areas that are closer to the white matter boundary, such as bundle outlines, have lower concentration, because the diffusion signal is weaker in those areas. In particular, fiber end points exhibit the lowest concentrations, as they are located right at the white matter boundary, as shown in Fig. 18b. In addition to per-point statistics, the per-fiber statistic log p can be used to automatically detect fiber outliers, as shown in Fig. 18c and d. Clearly, without marking fibers in comparison to the average log-likelihood, it is highly error-prone to discover such outliers visually in a tangled whole-brain tractogram. The illustrated implausible loop was found in the ISMRM ground-truth fibers, which are otherwise well prepared. This finding also underlines the difficulty of preparing high-quality reference standards for tractography. Lastly, we note that in contrast to e.g. curvature based outlierdetection, the Entrack model acts as a data-informed filter, i.e. it can recognize fibers, which are strongly bending, but supported by the diffusion data, whereas these fibers would be discarded by a curvature dominated filter.

HCP Retest Data and Posterior Agreement
In this section, we show the results for the optimal precision obtained with the PA criterion from Sect. 4.3, based on two independent DWI measurements of the HCP subject 917255. In Fig. 19, we show the measured values of the expected posterior agreement I β from Eq. (40), and the Fig. 19 Average log-posterior agreement (red crosses), and generalization error (blue). The phenomenological model P A β (Eq. (43), orange, solid) agrees well with the empirical PA, and achieves its maximum value iβ = 4.14, atβ/β 15 • = 0.41. expected generalization error ρ β from Eq. (23). In contrast to the Tractometer scores, we observe a clear optimum of I β at β/β 15 • = 0.41. As anticipated by our discussion in Sect. 3.5, the generalization error ρ β suggestsβ → ∞ and thus fails to provide a finite estimate for the precision.
Furthermore, we are interested to explain the empirical PA with a phenomenological model of the form which depends only on the summary statisticsθ,n. These are the average number of fibers per fixel where W (.) was introduced in Eq. (4a), and the average deviation between the fixel direction on the two instances: This description has one free parameter λ, which essentially captures how fast the local variance of a fiber bundle increases when the precision is decreased. We refer to appendix B for more details about the origin of the parameter λ. It is a joint property of the iterative tracking together with the local posterior, and the DWI data distribution. In particular, it can be considered as a measure of how fast bundles produced by a particular tracking algorithm, on a particular DWI source, tend to disintegrate when the precision is lowered, as illustrated by Fig. 20. In our case, using λ = 10 provides a good match between the measured PA, and the phenomenological model, as shown by the orange curve in Fig. 19. Its maximum value is iβ = 4.14, which means, at the given noise level, we can contract the Entrack posterior up to a concentration, which is equivalent to the partition of the sphere into ≈ 16 = 2 iβ equally sized cones. A higher resolution can not be argued for, since it would increasingly reduce the agreement between the posteriors on the two measurements. This effect is not captured by the expected generalization error ρ β (blue curve in Fig. 19), showing again that ρ β is not an appropriate measure to determine a finite optimal precision, which is necessary to maintain the benefits of probabilistic models.

Qualitative Results on HCP Data
We provide some qualitative tractography examples obtained at the optimal precisionβ in Fig. 21

Discussion & Conclusion
We have presented a general probabilistic model for spherical regression based on the Fisher von Mises distribution, with an application to connectomics and the underlying inference of streamlines in white matter of the brain. Our theoretical considerations advocate the model to address loss attenuation, and heteroscedastic uncertainty quantification. For the proposed FvM model, we investigate the issue of probabilistic overfitting in tractography, which is commonly encountered in different probabilistic models, but only addressed by ad-hoc solutions. For instance, Kumar and Tsvetkov (2018) experiment with different regularization terms for the concentration, but it remains unclear which should be recommended in other applications. The classification model for tractography by Benou and Riklin-Raviv (2019) suggests a label smoothing heuristic to assert finite concentrations, and to establish a notion of angular closeness between direction "classes". Instead, we advocate a regularization based on the maximum entropy principle. Specifically, we derive the Gibbs free energy for the FvM distribution, and discuss its theoretical properties, in particular the role of the precision parameter β. In contrast to tuning hyper-parameters in regularization heuristics, the meaning of β is clearly motivated as the inverse width of the posterior distribution.
Based on the free energy objective, we also propose an automatically paced annealing scheme for model training, in the spirit of the deterministic annealing algorithm (Hofmann and Buhmann 1997), which is used to find superior global optima of non-convex optimization problems. Apart from the maximum entropy approach, we argue that it is inherently impossible to determine the precision parameter β with com- Fig. 21 Selected fiber tracts obtained on HCP data by Entrack atβ. Top row: TractSeg reference fibers (Wasserthal et al. 2018 mon cross-validation techniques, since they do not bias the mode of the posterior distribution, but only its width. For this reason we propose a method, which takes into account the stability of the posterior distribution with respect to repeated measurements of the data, because the agreement between normalized distributions clearly depends on their width.
In the context of our tractography experiments, we refer to the entropy-regularized posterior distribution as Entrack model. Firstly, we study its capability of uncertainty quantification with prototypical cases of DWI data, which show that the Entrack model, parametrized by a neural network, has learned non-trivial patterns of streamline progression. Secondly, we employ the Entrack posterior, which describes the distribution of local streamline direction, for iterative whole-brain tractography to reconstruct long-range tissue connectivity. On one hand, we show that it produces competitive results in the Tractometer evaluation, which is based on the synthetic ISMRM15 phantom with known groundtruth. In particular, our ablation study shows a progressive improvement of Tractometer scores with respect to the baseline classification model. It is outperformed by deterministic regression, which is moreover improved by its probabilistic formulation, and even more so by the proposed entropyregularized Entrack model. This model ranking indicates the respective benefits of regression over classification, probabilistic over deterministic, and entropy-regularized statistical inference over the maximum likelihood technique.
However, as expected, the Tractometer evaluation, based on one data instance, does not support to determine a finite optimal precision, which is essential to maintain the benefits of probabilistic models. Instead, we show that the posterior agreement, computed from two independent DWI measurements, defines a finite optimal precision, which takes into account the stability of tractograms under data fluctuations. We complete our study with qualitative examples of whole-brain tractography on both, the synthetic ISMRM15 data, and real HCP data.
In summary, the study documents a supervised approach to infer streamlines from DWI data and it validates the results by monitoring the stability of tractograms for repeated DWI measurements. Our modeling strategy generalizes to other data analysis challenges in biomedicine where a gold standard is difficult to establish and standard approaches fail to provide uncertainty calibration in accordance with data noise.

A Fixel Posterior
Consider a fixel (z, b), represented by the average direction y in z,b := y in (z, b | T) of n z,b := n(z, b | T) streamlines y j .
To represent the fixel posterior as a joint posterior over its streamlines, we write p trk joint y | X(z), y in z,b ∝ n z,b j=1 p trk y | X(z), y j .
After normalization, the joint concentration can be approximated as where we have assumed in the first step, that the concentration is similar for all fibers of one fixel, and in the second step that the local fiber means are approximately aligned in the same direction.

B Precision-Dependence ofn
To estimate the precision dependence ofn, defined in Eq.
(44), we need to estimate the precision dependence of the term n z,b j=1 μ(X(z), y j ) 2 used in the approximation Eq. (47) for the concentration of the fixel posterior.
We make the assumption that the posterior means of the fibers entering some voxel z, and belonging to the same bundle b, are effectively distributed according to the same FvM with concentration β: so that we can approximate the norm of their empirical sum with the norm of their expectation: Moreover, we introduce the free parameter λ to arrive at the phenomenological approximation n β = W (β/λ)n. (50)

C DWI Feature Representations
In this section, we provide some details about commonly used feature representations X f of DWI measurements. More details can be found in introductory texts, e.g. by Alexander (2006). The diffusion tensor (DT) is arguably the most popular feature representation (Basser et al. 1994) for DWI measurements I . It is essentially a Gaussian model of the diffusion signal: f (g n | D r ) = I 0 exp −bg T n D r g n where D r is the positive definite, symmetric 3 × 3 diffusion tensor at location r, b an experimental constant, and I 0 the unattenuated reference intensity. The DT representation compresses the DWI signal at each voxel from N directions to the three orthogonal principal directions 1 , 2 , 3 of D r , and their respective positive eigenvalues λ 1 > λ 2 > λ 3 . 6 . If we condense these features into one vector, we have that X DT : V → R 6 .
While the DT representation proves to be fairly robust, it can not properly account for complex fiber configurations, which require a multi-modal representation of directions. For this purpose, an angular expansion in terms of spherical harmonic functions Y lm is commonly used. This representation is also referred to as fiber orientation distribution (FOD): Due to the inversion symmetry of the DWI signal, the odd coefficients l = 1, 3, 5, . . . are zero. If we retain coefficients up to l = 4, we have X F O D : V → R 15 ; r → {D lm r }.

Fig. 23
Log-likelihood log p trk β (y in | X(r), y in ) for low precision β (red), and high precision (blue). The inset FOD (gray dumbells, center) illustrates the fixed input data X(r). The polar angle θ controls the orientation of y in = R θ e x .

D The Tractometer Evaluation
In the first step of the evaluation, the Tractometer tool identifies fibers which connect correct pairs of ground-truth ROIs (VC), fibers which connect incorrect pairs of ROIs (IC), and such that don't connect any pair of ROIs (NC).
Next, IC fibers which are shorter than 35 mm, are also assigned to NC. The VC, IC, and NC metrics simply report the relative size of each set. Furthermore, the sets of VC and IC fibers are clustered independently into bundles of coherent fibers. The number of IC bundles constitutes the IB metric.
In contrast, the identified VC bundles are matched to the 25 ground-truth bundles, and the VB metric reports the number of successful matches. To obtain the bundle fidelity metrics, the identified valid bundles are converted to volumet-ric binary masks, which are compared to the corresponding ground-truth bundle masks.
The OL metric reports the relative intersection between the predicted bundle mask B, and the corresponding groundtruth bundle maskB, i.e. OL = |B ∩B|/|B|. Similarly, the OR metric reports the relative bundle overreach, i.e. OR = |B \B|/|B|. Lastly, the F1 metric is simply the harmonic mean of OL, and 1-OR.

E FvM-Functions for d = 3
In reference to Eq. (4), we provide the explicit formulas for the first moment norm, and entropy of the FvM distribution in three dimensions: C(κ) = κ/(4π sinh κ) . (54a)

F Case Study: Fiber Crossing
In addition to the unimodal case study of the Entrack posterior in Fig. 13, we show a bimodal case study in Figs. 22 and 23.

G ISMRM: CST Bundle Masks
We illustrate volumetric masks of the left CST on the ISMRM phantom in Fig. 24.

Fig. 24
Bundle-mask analysis of the left CST. The ground-truth mask (yellow) is shown in the first row, together with overlays of the predicted voxel mask (purple), and its overreach (red). In the center of the second row, we illustrate the streamlines, which the predicted mask is based on. The third row contains overlays of the overreach with the overlap (green), and the predicted mask with its underreach (blue). Underreach refers to voxels covered by the ground-truth bundle, but not by the predicted bundle. The precision is β/β 15 • = 1.58.