Efficient and generalizable tuning strategies for stochastic gradient MCMC

Stochastic gradient Markov chain Monte Carlo (SGMCMC) is a popular class of algorithms for scalable Bayesian inference. However, these algorithms include hyperparameters such as step size or batch size that influence the accuracy of estimators based on the obtained posterior samples. As a result, these hyperparameters must be tuned by the practitioner and currently no principled and automated way to tune them exists. Standard Markov chain Monte Carlo tuning methods based on acceptance rates cannot be used for SGMCMC, thus requiring alternative tools and diagnostics. We propose a novel bandit-based algorithm that tunes the SGMCMC hyperparameters by minimizing the Stein discrepancy between the true posterior and its Monte Carlo approximation. We provide theoretical results supporting this approach and assess various Stein-based discrepancies. We support our results with experiments on both simulated and real datasets, and find that this method is practical for a wide range of applications.


INTRODUCTION
Most MCMC algorithms contain user-controlled hyperparameters which need to be carefully selected to ensure that the MCMC algorithm explores the posterior distribution efficiently.Optimal tuning rates for many popular MCMC algorithms such the random-walk (Gelman et al., 1997) or Metropolis-adjusted Langevin algorithms (Roberts and Rosenthal, 1998) rely on setting the tuning parameters according to the Metropolis-Hastings acceptance rate.Using metrics such as the acceptance rate, hyperparameters can be optimized on-the-fly within the MCMC algorithm using adaptive MCMC (Andrieu and Thoms, 2008;Vihola, 2012).However, in the context of stochastic gradient MCMC (SGMCMC), there is no acceptance rate to tune against and the trade-off between bias and variance for a fixed computation budget means that tuning approaches designed for target invariant MCMC algorithms are not applicable.
Related work Previous adaptive SGMCMC algorithms have focused on embedding ideas from the optimization literature within the SGMCMC framework, e.g.gradient preconditioning (Li et al., 2016), RMSprop (Chen et al., 2016) and Adam (Kim et al., 2020).However, all of these algorithms still rely on hyperparameters such as learning rates and subsample sizes which need to be optimized.To the best of our knowledge, no principled approach has been developed to optimize the SGMCMC hyperparameters.In practice, users often use a trial-and-error approach and run multiple short chains with different hyperparameter configurations and select the hyperparameter setting which minimizes a metric of choice, such as the kernel Stein discrepancy (Nemeth and Fearnhead, 2020) or cross-validation (Izmailov et al., 2021).However, this laborious approach is inefficient and not guaranteed to produce the best hyperparameter configuration.
Contribution In this paper we propose a principled adaptive SGMCMC scheme that allows users to tune the hyperparameters, e.g.step-sizes h (also known as the learning rate) and data subsample size n.Our approach provides an automated trade-off between bias and variance in the posterior approximation for a given computational time budget.Our adaptive scheme uses a multi-armed bandit algorithm to select SGMCMC hyperparameters which minimize the Stein discrepancy between the approximate and true posterior distributions.The approach only requires a user-defined computational budget as well as the gradients of the log-posterior, which are already available to us via the stochastic gradient MCMC algorithm.A second contribution in this paper is a rigorous assessment of existing tuning methods for SGMCMC, which to our knowledge is not present in the literature.

BACKGROUND 2.1 Stochastic Gradient Langevin Algorithm
We are interested in sampling from a target density π(θ), where for some parameters of interest θ ∈ R d the unnormalized density is of the form π(θ) ∝ exp{−U (θ)}.We assume that the potential function U (θ) is continuous and differentiable almost everywhere.If we have independent data, y 1 , . . ., y N then π(θ) ∝ p(θ) N i=1 f (y i |θ) is the posterior density, where p(θ) is the prior density and f (y i |θ) is the likelihood for the ith observation.In this setting, we can define U (θ) = N i=1 U i (θ), where U i (θ) = − log f (y i |θ) − (1/N ) log p(θ).We can sample from π(θ) by simulating a stochastic process that has π as its stationary distribution.Under mild regularity conditions, the Langevin diffusion (Roberts and Tweedie, 1996;Pillai et al., 2012) has π as its stationary distribution, however, in practice it is not possible to simulate the Langevin diffusion exactly in continuous time and instead we sample from a discretized version.That is, for a small time-interval h > 0, the Langevin diffusion has approximate dynamics given by where ξ k is a vector of d independent standard Gaussian random variables.In the large data setting, we replace ∇U (θ) with an unbiased estimate ∇ Ũ (θ) = N n i∈Sn ∇U i (θ), calculated using a subsample of the data of size n << N , where S n is a random sample, without replacement, from {1, . . ., N }.This algorithm is known as the stochastic gradient Langevin dynamics (SGLD, Welling and Teh, 2011).
In this paper we present our adaptive stochastic gradient MCMC scheme in the context of the SGLD algorithm for simplicity of exposition.However, our proposed approach is readily generalizable to all other stochastic gradient MCMC methods, e.g.stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014).Details of the general class of stochastic gradient MCMC methods presented under the complete recipe framework are given in Ma et al. (2015).See Section D of the Supplementary Material for a summary of the SGMCMC algorithms used in this paper.

Stein Discrepancy
We define π as the empirical distribution generated by the stochastic gradient MCMC algorithm (1).We can define a measure of how well this distribution approximates our target distribution of interest, π, by defining a discrepancy metric between the two distributions.Following Gorham and Mackey (2015) we consider the Stein discrepancy

]
(2) where φ : R d → R d is any smooth function in the Stein set F which satisfies Stein's identity E π [A π φ(θ)] = 0 for all φ ∈ F. where Interested readers can find technical details on the corresponding choice of F in Gorham and Mackey (2017).The kernel k must be positive definite, which is a condition satisfied by most popular kernels, including the Gaussian and Matérn kernels.Gorham and Mackey (2017) recommend using the inverse multi-quadratic kernel, k(θ, θ ) = (c 2 +||θ−θ || 2 2 ) β , which they prove detects non-convergence when c > 0 and β ∈ (−1, 0).Finite Set Stein Discrepancy KSD is a natural discrepancy measure for stochastic gradient MCMC algorithms as π(θ) is only required up to a normalization constant and the gradients of the log-posterior density are readily available.The drawback of KSD is that the computational cost is quadratic in the number of samples.Linear versions of the KSD (Liu et al., 2016) are an order of magnitude faster, but the computational advantage is outweighed by a significant decrease in the accuracy of the Stein estimator.Jitkrittum et al. (2017) propose a linear-time Stein discrepancy, the Finite Set Stein Discrepancy (FSSD), which utilizes the Stein witness function g(θ The function g can be thought of as witnessing the differences between π and π, where a discrepancy in the region around θ is indicated by large |g(θ)|.The Stein discrepancy is essentially then measured via the flatness of g, where the measure of flatness can be computed in linear time.The key to FSSD is to use real analytic kernels k, e.g, Gaussian kernel, which results in g 1 , . . ., g d also having a real analytic form.If g i = 0 then this implies almost surely that g i (v 1 ), . . ., g i (v J ) are not zero for a finite set of test locations V = {v 1 , . . ., v J }.Under the same assumptions as KSD, FSSD is defined as, Theorem 1 of Jitkrittum et al. (2017) guarantees that FSSD 2 = 0 if and only if π = π for any choice of test locations {v} J j=1 .However, some test locations will result in an improved test power for finite samples and so, following Jitkrittum et al. (2017), we optimize the test locations by first sampling them from a Gaussian fit to the posterior samples and then use gradient ascent so that they maximise the FSSD.

HYPERPARAMETER LEARNING
In this section we introduce an automated and generally-applicable approach to learning the user-controlled parameters of a stochastic gradient MCMC algorithm, which throughout we will refer to as hyperparameters.For example, in the case of SGLD, this would be the stepsize parameter h and batch size n, or in the case of stochastic gradient Hamiltonian Monte Carlo, this would also include the number of leap frog steps.Our adaptive scheme relies on multi-armed bandits (Slivkins, 2019) to identify the optimal setting for the hyperparameters such that, for a given time budget, the selected parameters minimize the Stein discrepancy, and therefore maximize the accuracy of the posterior approximation.Our proposed approach, the Multi-Armed MCMC Bandit Algorithm (MAMBA), works by sequentially identifying and pruning, i.e. removing, poor hyperparameter configurations in a principled, automatic and online setting to speed-up hyperparameter learning.The MAMBA algorithm can be used within any stochastic gradient MCMC algorithm and only requires the user to specify the training budget and the number of hyperparameter sets.

Multi-Armed Bandits with Successive Halving
Multi-armed bandits are a class of algorithms for sequential decision-making that iteratively select actions from a set of possible decisions.These algorithms can be split into two categories: 1) best arm identification in which the goal is to identify the action with the highest average reward, and 2) exploration vs. exploitation, where the goal is to maximize the cumulative reward over time (Bubeck and Cesa-Bianchi, 2012).In the best-arm identification setting, an action, aka arm, is selected and produces a reward, where the reward is drawn from a fixed probability distribution corresponding to the chosen arm.At the end of the exploration phase, a single arm is chosen which maximizes the expected reward.This differs from the typical multi-armed bandit setting where the strategy for selecting arms is based on minimizing cumulative regret (Lattimore and Szepesvári, 2020).
The successive halving algorithm (Karnin et al., 2013;Jamieson and Talwalkar, 2016) is a multi-armed bandit algorithm based on best arm identification.Successive halving learns the best hyperparameter settings, i.e. the best arm, using a principled early-stopping criterion to identify the best arm within a set level of confidence, or for a fixed computational budget.In this paper, we consider the fixed computational budget setting, where the algorithm proceeds as follows: 1) uniformly allocate a computational budget to a set of arms, 2) evaluate the performance of all arms against a chosen metric, 3) promote the best 1/η of arms to the next stage, where typically η = 2 or 3, and prune the remaining arms from the set.The process is repeated until only one arm remains.As the total computational budget is fixed, pruning the least promising arms allows the algorithm to allocate exponentially more computational resource to the most promising hyperparameter sets.

Tuning Stochastic
Gradients with a Multi-Armed MCMC Bandit Algorithm (MAMBA) We describe our proposed algorithm, MAMBA, to tune the hyperparameters of a generic stochastic gradient MCMC algorithm.For ease of exposition, we present MAMBA in the context of the SGLD algorithm (1), where a user tunes the step size h and batch size n.Details on other SGMCMC algorithms can be found in Appendix D. We present MAMBA as the following three stage process: Initialize In our multi-armed bandit setting we assume M possible stochastic gradient MCMC hyperparameter configurations, which we refer to as arms.Each arm s in the initial set S 0 = {1, . . ., M } represents a hyperparameter tuple φ s = (h s , n s ).The hyperparameters in the initial set are chosen from a uniform grid.
Evaluate and Prune At each iteration of MAMBA, i = 0, 1, . .., each arm s is selected from the set S i and the s th SGLD algorithm is run for r i seconds using the hyperparameter configuration φ s .Each arm is associated with a reward ν s that measures the quality of the posterior approximation.We use the negative Stein discrepancy as the reward function that we aim to maximize.Specifically, we calculate the Stein discrepancy from the SGMCMC output using KSD (3) or FSSD (4), i.e. ν s = −KSD(π s , π) or ν s = −FSSD(π s , π).Without loss of generality, we can order the set of arms S i by their rewards, i.e. ν 1 ≥ ν 2 ≥ . . .≥ ν M , where ν 1 is the arm with the optimal reward at each iteration of MAMBA.The top 100/η% of arms in S i with the highest rewards are retained to produce the set S i+1 .The remaining arms are pruned from the set and not evaluated again at future iterations.
Reallocate time Computation time allocated to the pruned samplers is reallocated to the remaining samplers, r i+1 = ηr i .As a result, by iteration i, each of the remaining SGLD samplers has run for a time budget of R = r 0 + ηr 0 + η 2 r 0 + ... + η i−1 r 0 seconds, where r 0 is the time budget for the first MAMBA iteration.This process is repeated for a total of log η M MAMBA iterations.We use a log η base as we are dividing the number of arms by η at every iteration.Furthermore, we use a floor function for the cases where the initial number of arms M is not a power of η.The MAMBA algorithm is summarized in Algorithm 1.
Algorithm 1: MAMBA Input: Initial number of configurations M , total time budget T and pruning rate η (default η = 3).Sample M hyperparameter configurations and store in the set S 0 .
Run each SGLD sampler using (1) for time budget of r i seconds.Calculate KSD or FSSD for each sampler using (3) or (4), respectively.Let S i+1 be the set of |S i |/η samplers with lowest KSD/FSSD.
Guarantees It is possible that MAMBA will eliminate the optimal hyperparameter set during one of the arm-pruning phases.Through examination of the 1 − 1/2η quantile, we show in Theorem 1 that we can bound the probability that MAMBA, using negative KSD as the reward function, will incorrectly prune the best hyperparameter configuration and provide a bound on the maximum computation budget required for MAMBA to identify the optimal hyperparameter setting.
Theorem 1 i) MAMBA correctly identifies the best hyperparameter configuration for a stochastic gradient MCMC algorithm with probability at least , ii) For a probability of at least 1 − δ that MAMBA will successively identify the optimal hyperparameter set, MAMBA requires a computational budget of A proof of Theorem 1 is given in Appendix A and builds on the existing work of Karnin et al. (2013) for fixed-time best-arm identification bandits.Theorem 1 highlights the contribution of KSD variance in identifying the optimal arm.In particular, the total computation budget depends on the arm with the largest KSD variance.

Practical Guidance for Using MAMBA
Time budgets In the case of sampling algorithms like SGMCMC, the relevant budgets are total computation time or number of iterations.If we view sampling algorithms as a trade-off between statistical accuracy and computation time, then the goal is to identify the hyperparameters that produce the best Monte Carlo estimates under a given time constraint.Using a time budget, rather than number of iterations, means we can choose the best data subsample size in a principled way, as a smaller subsample will lead to a faster algorithm, but increase the Monte Carlo error.Additionally, using a time budget to select the optimal hyperparameters ties the statistical efficiency to the available computational power and implementation, e.g., programming language, hardware, etc.
Estimating KSD/FSSD Calculating KSD/FSSD using (3) or ( 4) requires the gradients of the log-posterior and the SGMCMC samples.Typically, one would calculate the KSD/FSSD using fullbatch gradients (i.e. using the entire dataset) on the full chain of samples.However, as we only use SGMCMC when the dataset is large, this would be a computationally expensive approach.Two natural solutions are to i) use stochastic gradients (Gorham et al., 2020), calculated through subsampling, or ii) use a thinned chain of samples.We investigate both options in terms of KSD/FSSD accuracy in Appendix B.2.We find that using the stochastic KSD/FSSD produces results similar to the fullbatch KSD/FSSD.However, calculating the KSD/FSSD for a large number of high dimensional samples is computationally expensive, so for our experimental study in Section 4 we use fullbatch gradients with thinned samples.This leads to lower variance KSD/FSSD estimates at a reasonable computational cost.Note that fullbatch gradients are only used for MAMBA iterations and not SGMCMC iterations.We find that this does not significantly increase the overall computational cost as for each iteration of MAMBA there are thousands of SGMCMC iterations.
Alternative metrics Stein-based discrepancies are a natural metric to assess the quality of the posterior approximation as they only require the SGMCMC samples and log-posterior gradients.However, alternative metrics to tune SGMCMC can readily be applied within the MAMBA framework.For example, there is currently significant interest in understanding uncertainty in neural networks via metrics such as expected calibration error (ECE), maximum calibration error (MCE) (Guo et al. (2017)), and out-of-distribution (OOD) tests (Lakshminarayanan et al., 2017).These metrics have the advantage that they are more scalable to very high dimensional problems, compared to the KSD (Gong et al., 2020).As a result, although KSD is a sensible choice when aiming for posterior accuracy, alternative metrics may be more appropriate for some problems, for example, in the case of very high-dimensional deep neural networks.

EXPERIMENTAL STUDY
In this section we illustrate MAMBA on three different models.We optimize the hyperparameters using MAMBA (Alg. 1) and compare against a grid search approach and a heuristic method.The initial arms in MAMBA are set as an equally spaced grid over batch sizes and step sizes (and number of leapfrog steps for SGHMC).The heuristic method fixes the step size to be inversely proportional to the dataset size, i.e. h = 1 N (Brosse et al., 2018).For both the grid search and heuristic approaches, we use a 10% batch size throughout.Note that MAMBA is the only method able to estimate both step size and batch size.Full details of the experiments can be found in Appendix C. Code to replicate the experiments can be found at https://github.com/RedactedForReview.All experiments were carried out on a laptop CPU (MacBook Pro 1.4 GHz Quad-Core Intel Core i5).For each example, the figures show results over a short number of tuning iterations and tables give results for longer runs.
For MAMBA, we set R = 1sec (i.e.: the running time of the longest sampler).We point out that this time budget is small compared to what would be used by most practitioners.However, this example illustrates the MAMBA methodology and compares it against a full MCMC algorithm which provides us with "ground-truth" posterior samples.To calculate the KSD/FSSD efficiently, we thin the samples and use fullbatch gradients.
In Figure 1, we plot the KSD calculated from the posterior samples for each of the tuning methods.We calculated the KSD curves for ten independent runs and plotted the mean curve along with a confidence interval (two standard deviations).The optimal hyperparameters given by each method can be found in Table 4 of Appendix C.1.Our results from Figure 1 show that optimizing the hyperparameters with MAMBA, using either KSD or FSSD, produces samples that have the lowest KSD over all but one of the six samplers.For the SGNHT sampler, the heuristic approach gives the lowest KSD, however, as shown in Table 4 in Appendix C.1, MAMBA-FSSD finds an optimal step size of h = N −1 , which coincides with step size given by the heuristic approach.Therefore, the difference in KSD from these two methods is a result of the batch size, which when taking into account computation time, MAMBA-FSSD finds 1% to be optimal, whereas the heuristic method does not learn the batch size and this is fixed at 10%.Ignoring computation time, a larger batch size is expected to produce a better posterior approximation.However, it is interesting to note that for the five out of six samplers where MAMBA performs the best in terms of KSD, MAMBA chooses an optimal batch size of 1%.  1 and further results including predictive accuracy on a test dataset and the number of samples obtained within the time budget are given in Table 5 of Appendix C.1.We find that the MAMBA-optimized samplers perform the best in terms of KSD.As a result, the Monte Carlo estimates of the posterior standard deviations generally perform well.We tested each sampler by running each sampler for 10 seconds.

Probabilistic Matrix Factorization
We consider the probabilistic matrix factorization model (Salakhutdinov and Mnih, 2008) on the MovieLens dataset 1 (Harper and Konstan, 2015), which contains 100K ratings for 1, 682 movies from 943 users (see Appendix C.2.1 for model details).We optimize the hyperparameters for six samplers: SGLD, SGLD-CV, SGHMC, SGHMC-CV, SGNHT, and SGNHT-CV.We use a similar setup as for logistic regression and tune each sampler using MAMBA with KSD, grid search, and the heuristic method.Details are given in Appendix C.2.
From Figure 2 we can see that the samplers tuned using MAMBA tend to outperform the ones tuned with the other two methods.We also test the quality of the posterior samples against NumPyro's ( Phan et al. (2019), Bingham et al. (2018)) implementation of NUTS (Hoffman and Gelman (2014)), which produces 10K samples with 1K samples as burn-in.This state of the art sampler obtains high quality samples but is significantly more computationally expensive, taking around six hours on a laptop CPU.We estimate the posterior standard deviations using these samples and treat them as the ground-truth.We run each SGMCMC sampler for 20 seconds, and estimate the standard deviation after  removing the burn-in.We estimate the posterior standard deviation for each sampler and show the relative errors and KSD in Table 2 (further results are given in Table 7 in Appendix C.2).We find that MAMBA consistently identifies hyperparameters that give the lowest KSD, but that for some samplers the heuristic approach gives a lower error on the estimated standard deviation.This could be due to the random realisation of the SGMCMC chain; however, while accuracy in standard deviation is fast to compute, as a metric it is not as useful as KSD, which measures the quality of the full distribution and not just the accuracy of the second moment.
For this example, as with the previous examples, we tune MAMBA using KSD, however, we validate the accuracy of the various tuning approaches against expected calibration error (ECE) and maximum calibration error (MCE) plotted in Figure 3.We find that the samplers tuned using MAMBA tend to outperform the other approaches in terms of ECE.
We assess the performance of the MAMBA-optimized samplers over a longer time budget and run the samplers for 300 seconds starting from the maximum aposteriori value.We then remove the visible burn-in and calculate the ECE and MCE to compare the quality of the posterior samples.We report the results in Table 3, where ECE and MCE are reported as percentages (lower is better).Overall, the results in Table 3 show that MAMBA-optimized samplers tend to perform best in terms of KSD and when not the best they produce results which are very close to the best performing method.For all samplers, MAMBA finds an optimal batch size of 1%, which is ten times smaller than the batch size of the other methods and therefore results in a faster and highly accurate algorithm.For SGNHT both MAMBA and grid search found a step size that was slightly too large (log 10 (h) = −4.5 and log 10 (h) = −4 respectively) which caused the sampler to lose stability for longer chains.In contrast, the sampler tuned using the heuristic method is the only one that remained stable.As a result we re-ran these two tuning methods for a grid with smaller step sizes: {−5., −5.5, −6., −6.5, −7., −7.5}.This smaller grid allowed the two tuning algorithms to find a stable step size (log 10 (h) = −5 for both methods), and so this slight decrease in step size was enough to make the sampler stable.We note that there exists samplers with more stable numerical methods such as the BADODAB sampler which solves the same diffusion as SGNHT but with a more stable splitting method (Leimkuhkler and Xiaocheng, 2016).Such samplers might be easier to tune with MAMBA or grid search.

DISCUSSION AND FUTURE WORK
Final remarks In this paper we have proposed a multi-armed bandit approach to estimate the hyperparameters for any stochastic gradient MCMC algorithm.Our approach optimizes the hyperparameters to produce posterior samples which accurately approximate the posterior distribution within a fixed time budget.We use Stein-based discrepancies as natural metrics to assess the quality of the posterior approximation.
The generality of the MAMBA algorithm means that alternative metrics, such as predictive accuracy, can easily be employed within MAMBA as an alternative to a Stein-based metric.Whilst not explored in this paper, it is possible to apply MAMBA beyond the stochastic gradient MCMC setting and directly apply MAMBA to standard MCMC algorithms, such as Hamiltonian Monte Carlo, to estimate the MCMC hyperparameters.
Finally, in this paper we performed a systematic study of different SGMCMC tuning methods for various models and samplers, which to our knowledge is the first rigorous comparison of these methods.While these alternative approaches can work well they are only able to tune the step size parameter, and unlike MAMBA, they do not tune the batch size or other SGMCMC hyperparameters, such as the number of leap frog steps.
Limitations and future work A limitation of this method is that computing the KSD can be expensive when there are many posterior samples.One solution we explored in this paper is to use the FSSD as a linear-time metric.In the case of KSD, we significantly lowered the cost of this by thinning the Markov chain, but the KSD remains an expensive metric to compute.The KSD also suffers from the curse of dimensionality (Gong et al., 2020), though our results show that the KSD gave good results even for our two high-dimensional problems.As a result, further work in this area should explore alternative discrepancy metrics which are both scalable in sample size and dimension.For example, scalable alternatives to KSD, such as sliced KSD (Gong et al., 2020), would be more appropriate for very high-dimensional problems.

Appendix A PROOF OF THEOREM 1
Lemma 1 Assume {θ i } P i=1 are P samples from π and k(θ, θ ) is a positive definite kernel in the Stein class of π and π .If E π k π (θ, θ ) 2 < ∞, and π = π, then for a Monte Carlo approximation of the kernel Stein discrepancy (3), we have where This lemma follows directly from Sections 5.5.1 and 5.5.2 of Serfling ( 2009).
Lemma 2 Let s be an arm from the set of arms S i = {1, 2, . ..} at iteration i of the MAMBA algorithm.We let s = 1 be the optimal arm with expected reward ν1 and we assume that the optimal arms was not eliminated at iteration i − 1 of the MAMBA algorithm.We then have for any arm s ∈ S i with estimated reward νs , Proof: Using the CLT result from Lemma 1 we assume that νs is an unbiased estimate of the reward for arm s with sub-Gaussian proxy σ 2 s , then by the Hoeffding inequality we have where α s := ν1 − νs and therefore P (ν s − νs ≥ α s ) = P (ν 1 < νs ) and the lemma follows.
Lemma 3 The probability that the best arm is eliminated at iteration i of MAMBA (Algorithm 1) is at most where Proof: This result follows a similar process to Lemma 4.3 from Karnin et al. (2013) but for a general η.If the best arm is removed at iteration i, then there must be at least 1/η arms in S i (i.e. 1 η |S i | = M/η i+1 ) with empirical reward larger than that of the best arm (i.e. a KSD score lower than the arm with the best possible KSD score).If we let S i be the set of arms in S i , excluding the |S i |/2η = M/2η i+1 arms with largest reward, then the empirical reward for at least |S i |/(2η − 1) = M/2η i+1 arms in |S i | must be greater than the best arm at iteration i.
Let N i be the number of arms in S i with empirical reward greater than the reward of the optimal arm, then by Lemma 2 we have, where σ 2 = max s∈S i σ 2 s and the final inequality follows from the fact that there are at least s i − 1 arms that are not in S i with reward greater than any arm in S i .Applying Markov's inequality we can obtain, Using Lemmas 2 and 3 we can now prove Theorem 1.
Proof: The algorithm cannot exceed to the budget of T (in our case T is given in seconds).If the best arm survives then the algorithm succeeds as all other arms must be eliminated after log η M iterations.Finally, using Lemma 3 and a union bound, the best arm is eliminated in one of the log η M iterations of the algorithm with probability at most .
This result completes the proof of Theorem 1.

B.1 Grid search and heuristic method
For grid search we run the sampler using the training data, and calculate the RMSE/log-loss/accuracy on the test dataset.
To have a fair comparison to MAMBA (see Section B.2), we always start the sampler from the maximum a posteriori estimate (the MAP, found using optimization).As a result we need to add noise around this MAP or else the grid search tuning method will recommend the smallest step size available which results in the sampler not moving away from the starting point.This happens because the MAP has the smallest RMSE/ log-loss (or highest accuracy).To fix this we add Gaussian noise to the MAP, and report the scale of the noise for each model in Section C.
The heuristic method fixes the step size to be inversely proportional to the dataset size, i.e. h = 1 N (Brosse et al., 2018).For both the grid search and heuristic approaches, we use a 10% batch size throughout.

B.2 MAMBA
We investigate the tradeoffs involved in estimating the KSD from samples in MAMBA.We can estimate this using the stochastic gradients estimated in the SGMCMC algorithm.However we can also calculate the fullbatch gradients and use these to estimate the KSD.Although the latter option is too computationally expensive in the big data setting, we can also thin the samples to estimate the KSD which may result in the fullbatch gradients being computationally tractable.
In Figure 4 we estimate the KSD of samples using the logistic regression model over a grid of step sizes.We run SGLD for the 3 models for 1 second and with a batch size of 1%.We estimate the KSD in 4 ways: fullbatch using all the samples, fullbatch using thinned samples (thin by 5), stochastic gradients using all samples, and stochastic gradients using thinned samples.In Figure 5 we do the same but varying the batch size (and keeping the step size fixed to h = 10 −4.5 .We can see that the KSD estimated using stochastic gradients and unthinned samples follows the fullbatch KSD well.However as calculating the KSD for many high dimensional samples is computationally expensive, we opt for using thinned fullbatch gradients in all our experiments.

C DETAILS OF EXPERIMENTS
We use the SGMCMC samplers in the package SGMCMCJax for the experiments, and we use NumPyro for the NUTS sampler (Phan et al. (2019), Bingham et al. (2018)).

C.1.1 Model
Consider a binary regression model where y = {y i } N i=1 is a vector of N binary responses and X is a N × d matrix of covariates.If θ is a d−dimensional vector of model parameters, then the likelihood function for the logistic regression
We start from the MAP with Gaussian noise (scale: σ = 0.2) and run the samplers for 5, 000 for each grid point.

C.1.3 MAMBA
We use the same grid for step sizes as grid search and use a grid of four batch sizes: 100%, 10%, 1%, and 0.1%.To calculate the KSD and FSSD we thin the samples by 10 and calculate the fullbatch gradients.For FSSD-opt we samples 10 test locations from a Gaussian fit to the samples and optimize them using Adam.
We start from the MAP with Gaussian noise (scale: σ = 1) and run 1, 000 iteration per grid point.

C.3.4 Results
We show in Table 8 the hyperparameters for the runs in Table 3.  with v as the momentum variable, θ the parameter of interest, h the step size, and α the friction coefficient.The friction coefficient α and the noise estimation term β are tunable hyperparameters.We have set them to be α = 0.01 and β = 0 for all the experiments.

D.2 SGNHT
From Ding et al. (2014a): we augment the parameter space with a momentum variable and a temperature variable.Here D is the dimension of the parameter.The tunable parameters are h and a.We fix a = 0.01) throughout.

D.3 Control variates
SGLD with control variates (SGLD-CV) uses the update in (1), but with an alternative estimate for the gradient Û (θ) defined as follows.Let θ M AP denote the maximum a-posteriori (MAP) estimate of the posterior.The estimator for the gradient at parameter θ is given by: See Baker et al. (2019) for more details on SGLD-CV.SGHMC-CV and SGNHT-CV are defined similarly.

Figure 1 :
Figure 1: KSD curves for the six samplers applied to a logistic regression model.

Figure 2 :
Figure 2: KSD curves for the six samplers applied to the probabilistic matrix factorization model.

Figure 3 :
Figure 3: ECE curves for the six samplers applied to the Bayesian neural network model.

Table 1 :
Logistic regression.For each tuning method and each SGMCMC sampler we report the relative standard deviation error and the KSD.We abbreviate MAMBA-KSD and MAMBA-FSSD to M-KSD and M-FSSD, respectively.

Table 2 :
Probabilistic Matrix Factorization.For each tuning method and each sampler we report KSD and the relative error of the standard deviation estimates.

Table 3 :
Bayesian Neural Network.For each tuning method and each sampler we report the ECE and MCE (as percentages).

Table 4 :
Logistic regression: hyperparameters for the results in Table1.The batch size is given by τ : the percentage of the total number of data.Namely: batch size n = τ N/100

Table 5 :
Comparison of tuning methods for Logistic regression.For each tuning method and each sampler we report the relative error of the standard deviation estimates, the KSD, the predictive accuracy, and the number of samples.Note that the number of samples generated within a fixed time budget depends on the subsample size.We try two version of MAMBA, one with KSD as a metric, and the other with FSSD-opt

Table 6 :
PMF: Hyperparameters for the results in Table2.The batch size is given by τ : the percentage of the total number of data.Namely: batch size n = τ N/100

Table 7 :
Comparison of tuning methods for PMF.For each tuning method and each sampler we report the relative error of the standard deviation estimates, the RMSE on test dataset, and the number of samples.

Table 8 :
NN: Hyperparameters for the results in Table3.The batch size is given by τ : the percentage of the total number of data.Namely: batch size n = τ N/100

Table 9 :
Comparison of tuning methods for the neural network model.For each tuning method and each sampler we report the ECE and MCE (as percentages), as well as the test accuracy and the number of samples.