Introduction

The analysis of data with means of statistical methods is a key aspect of scientific research. Depending on the field of research, the type of data and the size of the corresponding data sets can vary strongly, e.g., few event counts obtained in searches for rare radioactive decays, huge samples of astronomical data or images from medical imaging. The common theme connecting these different types of applications is the statistical analysis of the data. One is typically interested in estimating the free parameters of a scientific model given a particular data set, and in comparing two or more models. Bayesian reasoning allows for this in a consistent and easy-to-interpret way. The key element is the equation by Bayes and Laplace, that is

$$\begin{aligned} p(\varvec{\theta } | {\mathcal {D}},M) = \frac{p({\mathcal {D}} | \varvec{\theta }, M) p(\varvec{\theta } | M)}{\int \mathrm {d}\varvec{\theta } \, p({\mathcal {D}} | \varvec{\theta }, M) p(\varvec{\theta } | M)}, \end{aligned}$$
(1)

where the term on the left-hand side, \(p(\varvec{\theta } | {\mathcal {D}},M)\), is the posterior probability (density)Footnote 1 for the set of free parameters \(\varvec{\theta }\) given a data set \({\mathcal {D}}\) and assuming a model M. It is proportional to the product of the likelihood, \(p({\mathcal {D}}|\varvec{\theta }, M)\), and the prior knowledge about the parameters, \(p(\varvec{\theta }|M)\). The denominator is often referred to as the evidence Z; it is the probability to have observed the data \({\mathcal {D}}\) given the model M:

$$\begin{aligned} Z = P({\mathcal {D}}|M) = \int \mathrm {d}\varvec{\theta } \, p({\mathcal {D}} | \varvec{\theta }, M) p(\varvec{\theta } | M) . \end{aligned}$$
(2)

The evidence Z is required for model comparison.

Inference about individual parameters can be performed using the multi-dimensional posterior probability or the marginalized probabilities:

$$\begin{aligned} p(\theta _{i} | {\mathcal {D}}, M) = \int p(\varvec{\theta } | {\mathcal {D}},M) \, \prod _{i \ne j} \mathrm {d} \theta _{j} . \end{aligned}$$
(3)

We refer to commonly available textbooks for a general introduction to Bayesian inference as well as for the techniques and measures typically used, see, e.g., Refs. [1,2,3,4,5,6].

In most scientific applications, the model M results in a non-trivial form of the likelihood, such that assumptions that allow using common approximations do not hold (e.g., a Gaussian shape of the likelihood or a linear connection between the predictions and the model parameters). In such cases, it is often necessary to calculate integrals of the type appearing in Eq. 3 numerically. Efficient and reliable algorithms are an important aspect of such an evaluation, in particular for models with many parameters, or, more technically, many dimensions of integration. Similar arguments hold for the optimization problem of finding the best-fit parameters associated with the global or marginal modes of the posterior probability. A variety of automated tools are available, usually tailored to the needs of a particular field of research or a class of statistical models, such as STAN [7], PYMC [8], R [9] or OpenBUGS [10]. An important criterion to choose one tool over the others is its compatibility with the rest of the infrastructure used in a research field, typical data bases or programs used for processing the results obtained.

Due to the lack of such a tool in the field of particle physics, we originally developed the Bayesian Analysis Toolkit (BAT) [11], as a C++ library under the open-source LGPL license. It features several numerical algorithms for optimization, integration and marginalization with a strong focus on the use of Markov Chain Monte Carlo (MCMC) algorithms. BAT has been widely used in our field of research and examples of advanced applications in particle physics are global fits of complex models [12,13,14,15,16,17,18] and kinematic fitting [19]. Over time, BAT-C++ gained traction outside of particle physics as well. It has also been used in many other fields of research; for example in cosmology [20], astrophysics [21], and nuclear physics [22]. The sampling methods implemented in BAT-C++ have also been used to develop more advanced sampling algorithms [23, 24].

Given the wide range of possible applications, we began to develop a more easily portable version of BAT that does not come with the heavy dependencies on particle-physics software stacks and that also allows for smart parallelization. This development resulted in BAT.jl [25], a completely re-designed BAT implemented in Julia [26].

Here, we describe the design, features and numerical performance of version 2.0 of BAT.jl. It is is available at https://github.com/bat/BAT.jl under the MIT open-source license [27], and documented at https://bat.github.io/BAT.jl/stable/. The documentation also includes tutorials that new users can run and modify to quickly familiarize themselves with BAT.jl.

This paper is organized as follows: “Design Considerations and Software Design” describes the considerations that went into the design of the software and the code. “Numerical Algorithms” summarizes the numerical algorithms available in BAT.jl and “Output and Visualization of Results” the options provided to output and visualize the numerical results. Tests on the numerical performance of the algorithms is reported on in “Numerical Test Suite ”and an extended example demonstrating the strength of BAT.jl is introduced in “An Extended Example”. “Summary and Outlook” provides a summary.

Design Considerations and Software Design

Design Considerations

BAT.jl aims to help solve a wide range of complex and computationally demanding problems. The design of the implementation is guided by the requirement to support multi-threaded and distributed code and offers a choice of sampling, optimization and integration algorithms. At the same time, we want to offer a user-facing interface that makes it easy to quickly solve comparatively simple problems, while offering the direct access to lower level functionality and tuning parameters that an expert may need to solve very hard problems. Finally, we wanted to make it very easy for the user to interact with and visualize results of BAT.jl ’s algorithms.

We chose to implement BAT.jl in Julia due to Julia’s unique advantages for statistical and other numerical applications that require high numerical performance and easy composability of different algorithms.

Julia allows for writing code in an easy fashion, similar to Python, but at the same time enables that code to run with very high performance, such as code written in C, C++ or FORTRAN. In addition, Julia is one of the few languages based on multiple dispatch—this solves the expression problem [28], and therefore, results in a level of code-composability superior to object-oriented (i.e. single-dispatch) languages. This is complemented by Julia’s state-of-the-art package manager that makes if very easy for the user to install third-party packages.

Julia also enables automatic differentiation of almost arbitrary code via both multiple dispatch [29] and via it’s LISP-like meta-programming capabilities [30]. This makes it possible to use gradient-based algorithms such as HMC-sampling [31,32,33] and L-BFGS optimization [34] with automatic differentiation, so the user is not required to provide a hand-written gradient for likelihood and prior densities. Julia code can be run on both CPUs and GPUs [35]. The language also offers first-class support for writing multi-threaded and distributed code. These features significantly lower the effort required when tackling problems that require highly efficient code and massive computational resources.

Julia also has very good support for interfacing with code written in C, FORTRAN, C++, PYTHON and many other languages. Therefore, while BAT.jl itself is written in Julia, the user can easily access likelihood functions written in another language, typically with minimal or no impact on performance. This is important when the likelihood functions include complex existing (e.g., in physical or biological) models.

BAT.jl is designed to integrate well with related packages in the Julia software ecosystem. To further improve this integration and code reuse, we have released functionalities that may be also useful outside of BAT.jl ’s main scope as separate packages, e.g., ArraysOfArray.jl, ValueShapes.jl and EmpiricalDistributions.jl. As such, BAT.jl is modular, and we aim to improve this modularity in future releases.

Software Design

The software model of BAT.jl is centered on positive-definite densities. These may be normalized (and can then be viewed as probabilities) or not: likelihoods, priors and posteriors are all expressed as densities (represented by the type \({\texttt {AbstractDensity}}\)). BAT.jl automatically converts user-provided density-like objects like log-likelihood functions, distributions and histograms to subtypes of \({\texttt {AbstractDensity}}\).

Julia’s unique advantages as a multi-dispatch programming language allow us to provide a very compact user-facing API that still makes it possible to build complex statistical analysis chains based on fundamental operations like sampling, optimization and integration.

To operate on densities, BAT.jl offers functions like \({\texttt {bat\_sample}}\), \({\texttt {bat\_findmode}}\) and \({\texttt {bat\_integrate}}\). These can be combined in a very flexible and intuitive fashion: \({\texttt {bat\_sample}}\) will automatically try to sample from prior densities via iid (independent and identically distributed) sampling, from posterior densities via MCMC and from existing samples themselves via resampling. \({\texttt {bat\_findmode}}\) and \({\texttt {bat\_integrate}}\) will automatically sample, use optimization algorithms or analyse existing samples, depending on the given density.

BAT.jl has a unified mechanism to manage default behavior and algorithmic choices. The function \({\texttt {bat\_default}}\) lets the user query which algorithm with which settings would be used for a given task. BAT.jl also records the choice of algorithms and their configuration (whether explicit or implicit) in it’s results. In general, BAT.jl will always try to choose an appropriate default strategy for a given task, but will let the user override default choices for algorithms and configuration or tuning parameters.

To take advantage of the parallel architecture of modern computer systems, BAT.jl uses Julia’s advanced multithreading scheduler to parallelize operations automatically where possible. For example, MCMC chains automatically run on separate threads while the user can still use multi-threading within the implementation of the likelihood function to further load out the processors of the system, without over-subscription. MCMC sampling and integration can also be run on multiple remote hosts, using Julia’s support for compute clusters. MPI message transport can be used when available, but a plain TCP/IP network is sufficient.

We take great care to ensure that results are reproducible, independent of the possibly multi-threaded and distributed computation strategy. BAT uses a hierarchical scheme to partition and distribute counter-based random number generators (RNGs). By default, BAT uses the Philox RNG [36] to generate random numbers. We automatically partition this counter space (using a safe upper limit for the possible amount of RNG generation in each separate computation). Each MCMC chain, and even each step of each MCMC chain, effectively uses it’s own independent RNG—no matter which resources that step is scheduled to be computed on. If computations are hierarchical, each partition of an RNG counter space can be partitioned again and again, following the graph of the computation. The counter space of generators like Philox typically consists of two or four 64-bit numbers. Therefore, even nested parallel computations, each with an ample reserve of random numbers, will not run out of counter space.

Numerical Algorithms

Several algorithms for marginalization, integration and optimization are implemented in BAT.jl, giving it a toolbox character that also allows for the future inclusion of further methods, algorithms and software packages. The central algorithms available in BAT.jl are summarized in the following. We do not go into detail on additional minor functionalities, like simple evaluation of the probability distribution on a grid for a small number of dimensions and the usage of quasirandom sequences.

Sampling Algorithms

BAT.jl currently provides a choice of two main MCMC sampling algorithms to the user, Metropolis–Hastings (MH) and Hamiltonian Monte Carlo (HMC). Different algorithms are more or less suited for different target densities—for example, HMC sampling cannot be used if the target is not differentiable.

Metropolis-Hastings

The Metropolis–Hastings algorithm [37] is the original MCMC algorithm to produce a random set of numbers \(\theta\) or vectors \(\varvec{\theta }\) that have the properties of a Markov chain and that converge towards a target distribution. In Bayesian analysis, the limiting distribution of this set \(\pi (\varvec{\theta })\) will be the posterior probability density \(p(\varvec{\theta } | M)\). The samples are generated as follows: starting from a state \(\varvec{\theta }_{i}\) at iteration i, a new state \(\varvec{\theta '}\) is proposed according to a (often symmetric) proposal distribution \(g(\varvec{\theta '} | \varvec{\theta })\). The proposal is accepted with a probability:

$$\begin{aligned} P_{\mathrm{accept}} = \mathrm{min}\left( 1, \frac{\pi (\varvec{\theta '})}{\pi (\varvec{\theta }_{i})}\ \frac{g(\varvec{\theta }_{i} | \varvec{\theta '})}{g(\varvec{\theta '} | \varvec{\theta }_{i})}\right) , \end{aligned}$$
(4)

resulting in \(\varvec{\theta }_{i+1} = \varvec{\theta '}\), or \(\varvec{\theta }_{i+1} = \varvec{\theta }_{i}\) if the proposal is rejected. We run several Markov chains in parallel and repeatedly test for convergence during a burn-in phase (see “MCMC Burn-In Process”).

By default, BAT.jl uses a multivariate Student’s t distribution as the proposal distribution. The scale and correlation of the proposal are adapted automatically to efficiently generate samples from essentially any smooth, unimodal distribution. Another important characteristic of Markov chains is the acceptance rate \(\alpha\), the ratio of accepted proposal points to the total number of samples in the chain. For any given target and proposal distribution, there is an optimal \(\alpha\) that will allow the best exploration and performance of the chain.

To achieve a desired acceptance ratio the proposal distribution is tuned to adapt it to the target. After each tuning cycle (see “MCMC Burn-In Process”), the covariance matrix of the proposal function, \(\varvec{{\varSigma }}\), is updated based on the sample covariance of the last iterations and it is then multiplied with a scale factor c that governs the range of the proposal. c is tuned to force the acceptance rate to lie in a region of \(\alpha _{\min } \le \alpha \le \alpha _{\max }\) and is restricted to the region \(c_{\min } \le c \le c_{\max }\). The adjustment of the scale factor is descried in Algorithm 1 of [38]. The default values in BAT.jl for the acceptance rate and scale factor ranges are \(\alpha _{\min } = 0.15\), \(\alpha _{\max } = 0.35\) [39] and \(c_{\min } = 10^{-4}\), \(c_{\max } = 100\) respectively.

Hamiltonian Monte Carlo

One of the most sophisticated MCMC sampling methods is Hamiltonian Monte Carlo (HMC) [31,32,33]. Using a proposal function that is adjusted to the shape of the target distribution, HMC algorithms can yield higher acceptance rates and less correlated samples than other sampling algorithms based on random walks, thus reducing the number of samples required to fully explore the target distribution.

In HMC, the D-dimensional parameter space is expanded to 2D dimensions by introducing so-called momenta \(\mathbf {p}\) as hyperparameters, moving from the original phase space to the canonical phase space \(\mathbf {q} \rightarrow (\mathbf {q}, \mathbf {p})\). To conform to standard notation when discussing HMC, we here use \(\mathbf {q}\) to represent the parameters of the model in place of \(\varvec{\theta }\).

In the HMC formalism, the target distribution \(\pi (\mathbf {q})\) is lifted to the canonical phase space using a joint probability distribution:

$$\begin{aligned} \pi (\mathbf {q}, \mathbf {p}) = \pi (\mathbf {p}|\mathbf {q}) \pi (\mathbf {q}) = \mathrm {e}^{-H(\mathbf {q},\mathbf {p})}\,, \end{aligned}$$
(5)

where the probability distribution of the momenta \(\pi (\mathbf {p}|\mathbf {q})\) is chosen to be conditional. The last equality in Eq. (5) comes from defining the so-called Hamiltonian as

$$\begin{aligned} H(\mathbf {q},\mathbf {p}) = -\log \pi (\mathbf {q}, \mathbf {p}) = -\log \pi (\mathbf {p}|\mathbf {q}) -\log \pi (\mathbf {q})\,. \end{aligned}$$
(6)

The differential equations

$$\begin{aligned} \frac{dq_i}{dt} = \frac{\partial H}{\partial p_i}, \quad \frac{dp_i}{dt} = -\frac{\partial H}{\partial q_i}\,, \end{aligned}$$
(7)

are well known from classical mechanics and referred to as the Hamilton’s equations of motion. Solving the equations of motion for a certain time T allows moving along trajectories \(\phi\) and gives a transition in the canonical phase space

$$\begin{aligned} (\mathbf {q}, \mathbf {p}) \rightarrow \phi _T(\mathbf {q}, \mathbf {p}) = (\mathbf {q}^*, \mathbf {p}^*)\,, \end{aligned}$$
(8)

resulting in the new point \((\mathbf {q}^*, \mathbf {p}^*)\). By marginalizing over the momenta \(\mathbf {p}\), we obtain a new proposal point \(\mathbf {q}^*\) in the original parameter space. This proposal is then either accepted as a new sampling point or rejected by calculating an acceptance ratio, similar to the MH algorithm. Since the proposal points are generated using information of the target distribution, their acceptance rates are higher than samples using non-problem-specific proposal distributions.

Since HMC requires gradient information and introduces multiple hyperparameters (such as momenta and integration times) into the sampling process, performing Bayesian analyses with HMC samplers is usually not as straight-forward as using the MH algorithm as it requires additional computational steps such as the numerical integration of the equations of motions and the selection and tuning of the hyperparameters. BAT.jl uses the AdvancedHMC.jl package [40] for the single HMC sampling steps. AdvancedHMC.jl provides several flavours of HMC, including multiple versions of the No-U-Turn Sampler (NUTS) [41]. Higher level operations and the burn-in process are handled by BAT.jl itself, like for MH sampling. Due to the efficient support of automatic differentiation in Julia, e.g., through the package ForwardDiff.jl [29], the gradient of the target, required for HMC, can often be derived automatically. This makes it quite easy to use HMC within BAT.

MCMC Burn-In Process

Different MCMC samling algorithms have different tuning parameters, e.g., the scale and shape of the proposal function for MH. However, a common requirement for the generation of samples that faithfully follow the target density is a suitable burn-in process: starting with an initial sample, each MCMC chain must be allowed to run until is has converged to its stationary distribution. Several MCMC chains must be compared to ensure that they share the same stationary distribution and are not, for example, limited to different modes of the posterior.

BAT.jl will by default use four MCMC chains, which are iterated in parallel on multiple threads (and in the future, also on multiple compute nodes). We initialize each MCMC chain with a random sample drawn from the prior, and we require that efficient sampling is possible for all priors. Typically, priors will be composed from common distributions provided by the Julia package Distributions.jl, which supports iid sampling for all of it’s distributions.

Once the MCMC chains are initialized, burn-in, MCMC tuning and convergence testing are performed in cycles. The user specifies the desired number of samples after burn-in, the length of each tuning/burn-in-cycle is by default 10% of desired number of final samples. During each cycle, each MCMC chain is iterated and tuning parameters are adjusted in an algorithm-specific fashion. At the end of each cycle, we check for convergence of all MCMC chains. Tuning and burn-in are complete when all chains are tuned (according to algorithm-specific criteria) and have converged (see below). MCMC samples produced until the point are discarded by default, then chains are run for the desired number of steps (the user can also set limits like maximum wall-clock time) without further modification of tuning parameters. If tuning and convergence are not successful within a (user adjustable) maximum number of cycles, the user has the option between receiving a warning message or the sampling to terminate with an error exception.

Convergence Tests

To determine if the Markov chains have converged and the burn-in phase can stop, we adopt the Gelman–Rubin convergence test [42] and the Brooks–Gelman test [42] (our default).

We consider first a single parameter \(\theta\) and running N chains in parallel, where each chain produces M samples: \({\theta _{1i}, \ldots , \theta _{Ni}}\) (where \(i = 1,\ldots ,M\)). The Gelman–Rubin test relies on two estimators of the variance of \(\theta\): the within-chain variance estimate:

$$\begin{aligned} W = \sum _{i = 1}^{M} \sum _{j = 1}^{N} \frac{(\theta _{ij} - \bar{\theta _i})^2}{M(N-1)}, \end{aligned}$$
(9)

and the pooled variance estimate

$$\begin{aligned} {\hat{V}} = \frac{(N-1)W}{N} + \sum _{i=1}^{M} \frac{(\bar{\theta _i} - {\bar{\theta }})^2}{M-1} \, , \end{aligned}$$
(10)

where \(\bar{\theta _i}\) is the ith chain mean and \({\bar{\theta }}\) is the overall mean. Using these estimators, we construct the potential scale reduction factor (PSRF) denoted by \({\hat{R}}\):

$$\begin{aligned} {\hat{R}} = \frac{{\hat{V}}}{W} \, . \end{aligned}$$
(11)

Since the N chains are randomly initiated from an over-dispersed initial distribution, within a finite number of samples per chain, \({\hat{V}}\) overestimates the target variance while W underestimates it. This implies that \({\hat{R}}\) will have a value larger than 1 and the degree of convergence of the chains is measured by the closeness of \({\hat{R}}\) to the value 1.

Therefore, we construct the multivariate PSRF (MPSRF) denoted by \(\hat{R_p}\),

$$\begin{aligned} \hat{R_p} = \frac{N-1}{N} + \left( \frac{M+1}{M} \right) {\varLambda }_1 \,, \end{aligned}$$
(12)

where the variance estimates are

$$\begin{aligned} W^*= & {} \sum _{i = 1}^{M} \sum _{j = 1}^{N} \frac{(\varvec{\theta }_{ij} - \varvec{{\bar{\theta }}}_i)(\varvec{\theta }_{ij} - \varvec{{\bar{\theta }}}_i)^T}{M(N-1)} \,, \end{aligned}$$
(13)
$$\begin{aligned} \frac{B^*}{N}= & {} \sum _{i=1}^{M} \frac{(\varvec{\bar{\theta _i}} - \varvec{{\bar{\theta }}})^2}{M-1} \,, \end{aligned}$$
(14)
$$\begin{aligned} {\hat{V}}^*= & {} \frac{(N-1)W}{N} + \frac{B^*}{N}, \end{aligned}$$
(15)

and \({\varLambda }_1\) is the largest eigenvalue of the matrix \(\frac{{W^*}^{-1}B^*}{N}\). The default cut-off we use to declare convergence in the burn-in phase is \({\hat{R}}, {\hat{R}}_p \le 1.1\).

Effective Sample Size

A drawback of MCMC is that the samples we obtain are correlated. BAT.jl provides an effective sample size (ESS) estimator to calculate what number of iid samples would be equivalent to N given MCMC samples, in respect to the variance of sample-mean estimates. It is also a valuable indicator on whether a sufficient number of MCMC samples has been produced.

The effective sample size is estimated as

$$\begin{aligned} \mathrm {ESS} = \frac{N}{{\hat{\tau }}} \,, \end{aligned}$$
(16)

where \({\hat{\tau }}\) is the integrated autocorrelation time. \({\hat{\tau }}\) is estimated from the normalized autocorrelation function \({\hat{\rho }}(\tau )\):

$$\begin{aligned} {\hat{\tau }}_k= & {} 1 + 2 \sum _{\tau = 1}^{\infty } {\hat{\rho }}_k(\tau ) \,, \end{aligned}$$
(17)
$$\begin{aligned} {\hat{\rho }}_k(\tau )= & {} \frac{{\hat{c}}_k(\tau )}{{\hat{c}}_k(0)} \,, \end{aligned}$$
(18)
$$\begin{aligned} {\hat{c}}_k(\tau )= & {} \frac{1}{N - \tau } \sum _{n = 1}^{N - \tau } \left( \theta _{k,i} - {\hat{\theta }}_k \right) \left( \theta _{k,i+\tau } - {\hat{\theta }}_k \right) \,, \end{aligned}$$
(19)

where k refers dimension index of the multivariate sample \(\varvec{\theta }_i = \{ \theta _{1, i}, ...,\theta _{D, i} \}\). Here, all samples for a given parameter k are used, independently of whether multiple chains have been run. The first index refers now to the variable under discussion (as opposed to the chain number, above). These quantities allow us to calculate an effective sample size for each dimension \(\mathrm {ESS}_k = \frac{N}{{\hat{\tau }}_k}\).

When evaluating Eq. 17, we cannot, in practice, actually sum over all lags \(\tau\): while \({\hat{c}}_k(\tau )\) theoretically decays to zero for high lags \(\tau\), in practice it exhibits a noisy behavior that makes the sum over \({\hat{c}}_k(\tau )\) unstable. Therefore, we need to truncate the sum using a heuristic cut-off. The default cut-off in BAT.jl is Geyer’s initial monotone sequence estimator [43], optionally Sokal’s method [44] can be chosen.

Algorithms for Point Estimates

The global mode of a posterior distribution is often a quantity of interest. While the MCMC sample with the largest value of the target density may come close to the true mode, it is sometimes not as close as required. It is, however, an ideal starting point for a local optimization algorithm than can then further refine the mode estimation. BAT.jl offers automatic mode-estimation refinement using the Nelder-Mead [45] and LBFGS [46] optimization algorithms, by building on the Optim.jl [34] package. When using LBFGS, a gradient of the posterior distribution is required. Again, we utilize the Julia automatic-differentiation package ecosystem to automatically compute that gradient.

Another quantity that is often computed from samples is a marginal mode. To construct marginals, a binning of the samples is performed. The optimal number of bins can be determined by using Square-root choice, Sturges’ formula, Rice Rule, Scott’s normal reference rule, or the Freedman–Diaconis rule. The latter is the default.

BAT.jl also provides functionality to estimate other quantities such as the median, the mean, quantiles and standard deviations, and to propagate errors on a fit function.

Integration Algorithms

Evidence Estimation using AHMI

In many applications, it is desirable or even necessary to compute the evidence or marginal likelihood Z (see Eq. 2). An example for the use of Z is the calculation of a Bayes factor for the comparison of two models \(M_A\) and \(M_B\):

$$\mathrm{BF} \equiv \frac{p({\mathcal {D}}| M_A)}{p({\mathcal {D}}| M_{B})} = \frac{Z_A}{Z_{B}} \; .$$

BAT.jl includes the Adaptive Harmonic Mean Integration (AHMI) algorithm [47] to compute Z given the samples \(\{ \varvec{\theta } \}\).

AHMI can integrate samples from any sampling algorithm as long as the samples come in the form of \({\texttt {BAT.DensitySampleVector}}\). It’s use of hyper-rectangles, however, limits the applicability to a moderate number of dimensions (\(\approx 20\) in the case of a multivariate normal distribution).

Evidence Calculation Using an Interface to CUBA

In addition to integration via AHMI, BAT offers evidence calculation using the Cuba [48] integration library. Cuba implements multiple integration algorithms that cover a range of (Monte Carlo and deterministic) importance sampling, stratified sampling and adaptive subdivision integration strategies. These will typically not scale to high-dimensional spaces, but can provide quick and robust results for low-dimensional problems.

Parameter Space Transformations

Different algorithms have different requirements on the structure and domain of the densities they operate on. HMC, for example (like other gradient-based algorithms), requires a continuous target density. As a result, it does not perform well if the density is bounded. CUBA, on the other hand, can only operate on the unit hypercube, and so requires a bounded density. It is therefore often necessary to perform a change of variables, transforming the original density into one more suitable for the chosen algorithm. The prior distribution will typically contain sufficient information on the structure and domain of the posterior distribution to choose an suitable transformation.

BAT.jl will, by default, automatically try to internally transform posterior densities, so that the prior becomes equivalent to a standard multivariate-normal or multivariate-uniform distribution in the transformed space, depending on the requirements of the algorithm chosen to operate on the posterior. Prior distributions are often Product distributions and these are simply transformed elementwise. For univariate distributions, BAT.jl will transform according to their (inverse) CDF, multivariate normal distributions are transformed according to their covariance matrix, and hierarchical distributions are transformed iteratively. The mechanism can be extended by specialized transformations for complex or custom prior distributions. Automatic transformations for additional relevant multivariate prior distributions, e.g. Dirichlet distributions, will be added in future versions.

It is of course also possible for the the user to transform target densities manually.

Output and Visualization of Results

The results of running the numerical algorithms in BAT.jl are presented in text and also in graphical form. In addition, user-defined interfaces can be written to bring the results into any other format.

Graphical Summary of the Results

As a key element of all statistical analyses is the graphical representation of outcomes, BAT.jl includes functionalities to create visualizations of the analyses results in a user-friendly way. By providing a collection of plot recipes to be used with the Plots.jlFootnote 2 package, several plotting styles for 1D and 2D representations of (marginalized) distributions of samples and priors are available through simple commands. Properties of the distributions, such as highest density regions or point estimates like mean and mode values, can be automatically highlighted in the plots. Further recipes to visualize the results of common applications, such as function fitting, are provided. While the plot recipes provide convenient default options, the details of the plotting styles can be quickly modified and customized. Since all information about the posterior samples and the priors are available to the user, completely custom visualizations are of course also possible. Examples of plots created with the included plot recipes are shown in “An Extended Example”.

Written Summary of the Results

BAT.jl can display a written summary containing information about the sampling process and the results of the parameter estimation. The summary includes a list of all parameters with their mean, standard deviation, global and marginal mode. If the number of parameters is not too high, the written summary also includes a table with the covariance of all parameters. In addition, the effective number of samples is reported.

All of these summary results are also accessible in a programmatic fashion using BAT.jl functions, so the user can utilize them in further automated analysis.

File I/O

It is important that users have means to easily preserve the results of the MCMC sampling process, which often is computationally expensive. We provide several ways to do this:

BAT.jl is compatible with the Julia package JLD2. JLD2 provides a mechanism to store almost arbitrary Julia data structures to disk in an HDF5-based format. JLD2-output may not be readable with other software versions later on though, since almost everything is preserved, including the density functions themselves. Therefore, the combination of BAT.jl and JLD2 allows for complete persistence, but is intended for short-term storage.

To allow for long-term data preservation, BAT.jl also provides a function to store posterior sample variates, weights and log-density values to HDF5 files in a more basic fashion in plain HDF5 data sets (i.e. flat arrays). Any version of BAT.jl will always be able to read sampling output stored this way by older versions. The output is also easy to read using any programming language basic HDF5-support.

Furthermore, samples can also be easily written to ASCII/CSV-files using standard Julia packages, since BAT’s sampling output is compatible with standard Julia table formats.

Numerical Test Suite

A test suite to evaluate the numerical performance of the sampling algorithms is included in BAT.jl, and must be passed before each release of a new version. Samples are MCMC-generated from, and then compared to, a set of test distributions. A list of these distributions is given in Table 1. We compare the mean values, variances, and the global modes of the samples with those of the test distributions. We also calculate the p values of Kolmogorov–Smirnov (KS) tests for each parameter, by comparing the marginal distributions from the sampling algorithm with marginal distributions from samples generated by iid sampling. Small p values lead to further investigations to ensure that the sampling algorithm is functioning properly.

Additionally, the integral of the target distributions is calculated from the samples using AHMI. Since AHMI relies on an accurate sampling of the target distribution, the AHMI integral value provides a very sensitive test of the sampling algorithm.

Table 1 Listing of the analytical form of two-dimensional test functions used for performance testing
Fig. 1
figure 1

BAT default plots for the multi-modal Cauchy distribution. The plots on the upper left and lower right show the marginalized distribution for each dimension. The other two plots show the full 2D distribution with the lower left plot focusing on illustrating probability intervals and the upper right one the general shape of the 2D sample. The dashed line indicates the global mode of the sample, while the green, yellow and red colored samples are defined by the \(68.57\%\), \(99.5\%\) and \(99.7\%\) quantiles

Fig. 2
figure 2

BAT default plot for the funnel distribution. The plots on the upper left and lower right show the marginalized distribution for each dimension. The other two plots show the full 2D distribution with the lower left plot focusing on illustrating probability intervals and the upper right one the general shape of the 2D sample. The dashed line indicates the mode of the sample whilst the green, yellow and red colored areas represent the \(68.57\%\), \(99.5\%\) and \(99.7\%\) intervals of the sample respectively. Both distributions are normalized to unity

Fig. 3
figure 3

Pull plot of the difference between the analytical function (red curve) and the distribution of the samples (blue histogram) for the multi-modal Cauchy (left) and funnel distribution (right)

Table 2 Performance test results for two-dimensional functions
Table 3 Non-default MCMC settings used for tests in n-dimensions
Fig. 4
figure 4

Integral calculated using AHMI for the normal, multi-modal Cauchy and funnel distribution between 2 and 20 dimensions. The colored areas represent the uncertainty provided by AHMI

Fig. 5
figure 5

KS test p values calculated for each marginal for the normal, multi-modal Cauchy and funnel distribution between 2 and 20 dimensions. The horizontal axis indicates the number of dimensions, while the p value is given on the vertical axis

Fig. 6
figure 6

KS p values for all marginals of the sampled funnel distribution with 20 up to 35 dimensions, using both the HMC and the MH sampling algorithm. The horizontal axis indicates the number of dimensions, the p value is given on the vertical axis

As an example, Figs. 1 and  2 show the distributions of the samples generated for a multi-modal Cauchy and for the funnel distribution, respectively.

Assuming iid sampling and a large number of samples, the differences, in units of standard deviations, between the observed distributions and the true distributions are expected to follow a unit normal distribution. This should also be the case if the number of MCMC samples is large enough. For our tests, we compare the expectations in intervals (bins) of the function arguments. The standard deviation is estimated for each bin as the square root of the expected number of entries from the test function. For each bin with an expectation larger than ten, the observed number of entries is divided by that standard deviation. A histogram of these values, also referred to as pull plot, can be seen in Fig. 3. It is compatible with expectations.

Table 2 summarizes the expected and observed mean values, variances and global modes for the different two-dimensional test functions, together with the corresponding KS test p values and AHMI integral values. Very good agreement is observed in all distributions with a maximal deviation of \(4\%\) in the mode, \(4\%\) in the variance. The AHMI integrals are all very close to the true values, and are typically within the reported uncertainty. The smallest KS test p value of 0.093. We note that the ESS defined in “Effective Sample Size” is used in calculating the p value for the KS test. For the Cauchy distributions, the p values close to 1 indicate that the ESS values may be underestimated. We have noticed that this can occur when the samples become highly correlated.

For the two-dimensional test cases, we use we use 8 Markov chains with \(10^6\) steps per chain. For the n-dimensional test cases, we use 4 Markov chains and some additional non-default settings, detailed in Table 3. Note that while we relax the Brooks–Gelman convergence threshold in the test suite, to be able to probe some higher dimensional cases more easily, we use more sensitive tests to actually verify correct sampling (see above). BAT v2.0.0 defaults were used for all other settings (like hyper-parameters). For the sampling tests, we do not use BAT’s space transformation capabilities, so that the tests measure the performance of the sampler on the actual target distributions.

The AHMI integral and KS test p values are calculated for the test functions from 2 up to 20 dimensions. Figure 4 shows the integral values and their uncertainties. The integral of the multi-modal Cauchy and funnel distribution are calculated for up to 10 and 16 dimensions, respectively, using AHMI, whereas the integral for the normal distribution is calculated for up to 20 dimensionsFootnote 3. In all cases where the AHMI algorithm is able to report an integral value, the result is compatible within the quoted uncertainty with the expected value. The distribution of the KS test p values for the test functions from 2 to 20 dimensions is shown in Fig. 5. The distributions of the p values for the normal and funnel distribution are compatible with the expectation. The p values for the Cauchy distribution are, similar to the 2 dimensional performance measures, closer to one due to higher correlation of the samples. We have also executed the test suite for the HMC sampling algorithm. Here, we present results for the funnel distribution in 20 up to 35 dimensions.

The KS p values, shown in Fig. 6 follow an approximately flat distribution between 0 and 1, indicating that both sampling algorithms perform well.

An Extended Example

In the following, we demonstrate the potential of BAT.jl by solving a realistic problem of a type often encountered in particle and astroparticle physics experiments, namely, fitting a model to a set of data and determining if a specific signal process is present in the data.

In the example, we imagine that we are searching for a rare phenomenon: e.g., a particular nuclear decay, which leaves a specific and well-defined signature in the experiment. The experiment itself comprises several different detectors that can measure the energy of an event and which are all sensitive to the signal in a limited energy window, for example from 0 to 200 keV. The data we collect will come from two different sources, signal and background. We assume that while shielding measures are present that limit the detection of background events, we are not able to suppress them completely.

To claim a discovery of the signal in this kind of experiment, it is not sufficient to detect events close to the energies predicted by the theory of the desired signal. Instead, the task is to make a statement on the probability of having detected signal events in the presence of background. This implies the comparison of two different models, namely the background-only (BKG) model, where we assume that no signal is present in the data and all events are due to background sources, and the signal-plus-background (S+BKG) model, where we assume that we detected events from both sources. We fit both of these models to the data and then compare them using a Bayes Factor.

Data Model

As the experimental observable is set of energy values \({\varvec{E}}\), we will formulate the model for signal and background processes in this quantity. We assume that the probability distribution for background events follows an exponential function characterized by a decay constant \(\lambda\), that is

$$\begin{aligned} p_{B}(E | \lambda ) = \lambda \mathrm {e}^{-\lambda \cdot E} \, . \end{aligned}$$
(20)

The probability distribution for signal events follows a normal distribution with known mean value \(\mu _S\) and also known standard deviation \(\sigma _S\), that is

$$\begin{aligned} p_{S}(E | \mu _S, \sigma _S) = \frac{1}{\sqrt{2 \pi } \cdot \sigma _S } \mathrm {e}^{-\frac{1}{2}\left( \frac{E - \mu _S}{\sigma _S} \right) ^2} \, . \end{aligned}$$
(21)

In the current example, we chose \(\mu _S = 100 \, \text {keV}\) and \(\sigma _S = 2.5 \, \text {keV}\). Each detector will operate for a finite amount of time, \(T_{i}\), also referred to as exposure. The total number of expected background events for detector i is then

$$\begin{aligned} \mu ^B_{i} = T_{i} \cdot B_{i} \, , \end{aligned}$$
(22)

where \(B_{i}\) [\({\text {counts}}/{\text {year}}\)] estimates the background rate, i.e. the number of events per year of operation, assuming that the rate of background events does not change.

Similarly, we can estimate the expected number of signal events in detector i as

$$\begin{aligned} \mu ^S_{i} = T_{i} \cdot \epsilon _{i} \cdot S \, , \end{aligned}$$
(23)

where \(\epsilon _{i}\) is the efficiency of the detector to recognize the signal event and S [\({\text {events}}/{\text {year}}^{-1}\)] is the signal rate and is representative of the signal strength.

Apart from modeling the data collected in the experiment, we might also need to model the detectors themselves. Suppose we use a total of five detectors in our experiment, but given that it takes time to build them, we start operating the detectors at different times resulting in different exposures. Since the detectors are produced one at a time and the manufacturer has time to refine the production process, the detection efficiency might be better for detector produced at a later stage. We can also assume that the background rates will not be exactly the same but they will be close to each other, since this quantity mostly depends on the properties of the detector material and the production process. In order to account for the correlation between the background rates of the different detectors, we assume that the individual background rates \(B_{i}\) are randomly distributed according to a log-normal distribution, that is

$$\begin{aligned} p(B_{i}) \sim {\text {log-normal}} \left( \mu _B, \sigma _B \right) \, . \end{aligned}$$
(24)

The log-normal distribution is a commonly used prior for non-negative parameters.

Since \(p(B_{i})\) depends on \(\mu _B\) and \(\sigma _B\), our prior will have a hierarchical, resp. layered structure. BAT.jl allows the user to express hierarchical priors in straightforward fashion, the prior distribution of some model parameters can be expressed as a function of other model parameters.

Given the parameters \(\mu _{B}\) and \(\sigma _{B}\), the mean of the log-normal distribution is \(m_{B} = e^{\mu _{B} + \frac{\sigma _{B}^2}{2}}\). In the following, we found it more intuitive to work with \(m_{B}\) and then set \(\mu _{B} = f(m_{B}, \sigma _{B})\). In our example, we assume five detectors with an exposure and efficiency given in Table 4.

Table 4 Exposure and efficiency of the fictional detectors

Statistical Model

Since we have five detectors with different exposures and detection efficiencies, we split our data into five different data sets \({\mathcal {D}}_i\). The likelihood for the S+BKG model and a single data set is then

$$\begin{aligned}&{\mathcal {L}}_i({\mathcal {D}}_i | S, B_i) \nonumber \\&\quad = \prod _{j = 1}^{N^{{\text {obs}}}_i} \frac{1}{\mu ^B_i + \mu ^S_i} \nonumber \\&\qquad \left[ \mu ^B_i \lambda \mathrm {e}^{-\lambda E_j} + \mu ^S_i \frac{1}{\sigma _S \sqrt{2 \pi }} \mathrm {e}^{-\frac{1}{2}\left( \frac{E_j - \mu _S}{\sigma _S} \right) ^2} \right] \, , \end{aligned}$$
(25)

where \(N^{{\text {obs}}}_i\) is the number of events in data set \({\mathcal {D}}_i\). The total likelihood is constructed as the product of all \({\mathcal {L}}_i\) weighted with the Poisson terms [49]:

$$\begin{aligned}&{\mathcal {L}}(\left\{ {\mathcal {D}}_{i} \right\} | S, \left\{ B_{i} \right\} ) \nonumber \\&\quad = \prod _{i = 1}^5 \left[ \frac{\mathrm {e}^{-(\mu ^B_i + \mu ^S_i)}\left( \mu ^B_i + \mu ^S_i \right) ^{N_i^{{\text {obs}}}}}{N_i^{{\text {obs}}} !} \cdot {\mathcal {L}}_i({\mathcal {D}}_i | S, B_i) \right] \, . \end{aligned}$$
(26)

We use the same likelihood for the BKG model, but with all \(\mu ^S_i\) set to zero.

Apart from the likelihood, we also specify the priors for the free parameters of the model. These are the signal rate \(S \sim {\text {Uniform}}(0, 10)\) \({\text {year}}^{-1}\), the background rate parameters \(m_B \sim {\text {Uniform}}(0, 5 \cdot 10^{-2})\) \({\text {counts}} / {\text {year}}\) and \(\sigma _B \sim {\text {Uniform}}(0.1, 1.0)\) \({\text {counts}} / {\text {year}}\) as well as the decay constant \(\lambda \sim {\text {Uniform}}(0, 100)\).

Fig. 7
figure 7

Binned generated data

Data and Results

The data for the analysis are generated synthetically. We choose a decay constant of \(\lambda _{{\text {true}}} = 50\) with background rate parameters \(m_B = 4.7\) and \(\sigma _b = 0.5\). In addition, we include three signal events, i.e., \(S=0.9375\). The exact data used in this example are available in Tables 5 and 6 in Appendix “Example Data” and Fig. 7 shows the binned histogram of these data. As can be seen, without knowing that there are three signal events at \(100\,\text {keV}\), it would be very difficult to recognise them just by looking at the data.

Fig. 8
figure 8

Posterior distribution of the signal-rate S and the background-rate \(\lambda\). The blue line (in the upper-left and lower-right plot) shows the prior

Fig. 9
figure 9

Posterior distribution of the background rate mean \(m_B\) and its sigma \(\sigma _B\). The blue line (in the upper-left and lower-right plot) shows the prior

Fig. 10
figure 10

Distribution of all data events, compared with the S+BKG model using best fit parameters and central quantiles according to the full posterior distribution

In the following, BAT.jl v2.0.0 default settings (e.g., hyper-parameters) are used, unless noted otherwise.

We sample the posterior densities of both models using HMC with 4 MCMC chains and \(10^5\) steps per chain. BAT.jl’s automatic internal parameter space transformation allows us to do this even though the prior is hierarchical and contains bounded distributions.

To determine whether the model with or without signal should be preferred, we also compute the evidences of the BKG and S+BKG models by applying the AHMI algorithm to the posterior samples. We then calculate the Bayes factor under the assumption of the same prior probability for the two models, that is

$$\begin{aligned} {\text {BF}} = \frac{p({\text {S+BKG}} | {\mathcal {D}})}{p({\text {BKG}} | {\mathcal {D}})} = 3.4 \, , \end{aligned}$$
(27)

which supports the claim that the data contains both signal and background events.

Having determined that our data does indeed contain signal, we look at the marginal posterior distributions in order to to check how well the fit reconstructs the parameters used to generate the synthetic data. With BAT.jl the user can easily plot the results just like in Fig. 8, which shows both the 1D and 2D marginal posterior distributions for the signal rate S and the background decay constant \(\lambda\). In the marginalized distribution of a parameter, the mode is representative of the most likely scenario and inspecting the modes in Fig. 8 we notice that S peaks at 0.94 while \(\lambda\) peaks at 47. Both modes are very close to the nominal values that were used in data generation.

Since we assume that there was a correlation between the background rates of the individual detectors, we examine the posterior of the model parameters that control the distribution of the \(B_i\)-s in Fig. 9. We notice that a mean background rate \(m_B\) between 6 and 7 events per year is most likely. The spread of the posterior log-normal distribution will likely be small since the posterior of the parameter \(\sigma _B\) exhibits an exponential decay peaking at 0.

Finally, we compare our S+BKG model with the data in Fig. 10.

Summary and Outlook

We have developed a platform-independent software package for Bayesian inference, BAT.jl. BAT.jl features a toolbox for numerical algorithms to perform calculations often encountered in Bayesian inference, in particular sampling, optimization and integration algorithms as well as flexible input/output routines. BAT.jl also allows for interfacing with arbitrary custom codes, e.g. for the evaluation of complex models. We use the Julia programming language to provide a lightweight but powerful interface, parallel processing and automatic differentiation. We intend for the package to appeal to a wide user base, not constrained to a specific realm of science. The main application of BAT.jl is to study models that are characterized by (numerically) complex likelihood functions. In this paper, we describe the design choices, the implemented algorithms, and the procedure to test the implementation. We also give a concrete physics example that demonstrates the capabilities of BAT.jl. BAT.jl has already seen first use in several scientific works [50,51,52,53,54,55].

For the future, we plan to extend the functionality available in BAT.jl further, adding more algorithms, novel sampling schemes and multi-level parallelization.