1 Introduction

Despite their long history, linear regression models remain a key building block of many present-day statistical analyses. In the modern setting, practitioners not only show interest in making good predictions but also intend to investigate underlying low-dimensional structure based on the belief that only a small subset of predictors play a crucial role in predicting the response. These problems can be addressed by variable selection. A variable selection method is an automatic procedure that selects the best (small) subset of covariates that explains most of the variation in the response (Chipman et al. 2001). Frequentist approaches focus on model comparisons through information criteria or point estimates, using e.g. maximum penalised likelihood under sparsity assumptions (Hastie et al. 2015). Alternatively the Bayesian approach can be taken by imposing an appropriate prior on all possible models and computing the posterior.

We consider Bayesian variable selection (BVS) with spike-and-slab priors (Mitchell and Beauchamp 1988), which lead to natural uncertainty measures such as posterior model probabilities and marginal posterior variable inclusion probabilities. Suppose a linear regression model with p candidate covariates is given, we focus on a random variable \(\gamma \in \Gamma = \{0,1\}^p\) where \(\gamma _j=1\) indicates that the j-th covariate is included in the model. The exact posterior distribution of \(\gamma \) is challenging to compute, and when \(p>30\) Markov chain Monte Carlo (MCMC) algorithms are typically used to estimate posterior summaries of interest (George and McCulloch 1993; Chipman et al. 2001). Garcia-Donato and Martinez-Beneito (2013) discuss the use of the Gibbs sampler whereas Madigan et al. (1995) (\(\text {MC}^3\)) and Brown et al. (1998) (Add-Delete-Swap) propose random-walk Metropolis-Hastings algorithms. Yang et al. (2016) provide conditions on the the Add-Delete-Swap algorithm for rapid mixing in the sense that the mixing time grows at most polynomially in p under some mild conditions on the posterior distributions. These approaches can, however, suffer from an unexpectedly long mixing time and therefore slow convergence when p is large. For this reason, alternative informed MCMC schemes have gained popularity for problems with discrete parameter spaces (having already achieved prominence in the continuous setting). Informed MCMC schemes are those in which the Metropolis-Hastings proposal exploits some information about the target distribution. Intuitively, the success of informed proposals relies on avoiding models with low posterior model probabilities (Zhou et al. 2021). Titsias and Yau (2017) describe the Hamming ball sampler (HBS) in which models are proposed in proportion to their locally-truncated posterior probability within a Hamming ball neighbourhood. Zanella and Roberts (2019) consider a Tempered Gibbs sampler (TGS), which involves importance sampling and more frequently updates components with lower conditional distributions. A more general class of locally informed and balanced proposals is introduced by Zanella (2020). These locally balanced proposals can be obtained by weighting a base kernel using a balancing function, which is a function of the posterior distribution that satisfies a certain functional property. The base kernel is typically concentrated on a neighbourhood of the current state, resulting in a proposal that is informed and balanced using “local” information about the posterior. The author shows that a random walk proposal is asymptotically dominated by its locally balanced counterpart in the Peskun sense as dimensionality increases under mild conditions on the target distribution (Peskun 1973; Tierney 1998). Zhou et al. (2021) present a Locally Informed and Thresholded proposal (LIT) which replaces the balancing function by a thresholded weighting function (i.e. a thresholding function). The LIT scheme is closely connected to the locally balanced proposal because the thresholding function behaves like a flexible composition of globally and locally balanced functions. This novel scheme has been shown to have a dimension-free mixing time bound under similar conditions as in Yang et al. (2016). For other developments concerning locally informed proposals, see e.g. Livingstone and Zanella (2019); Gagnon (2021); Power and Goldman (2019).

Since the posterior distribution is discrete-valued, the above random-walk or informed MCMC schemes can be viewed as neighbourhood samplers. A neighbourhood sampler is an MCMC scheme which can be decomposed into two stages: (i) construct a neighbourhood that is a set of states (models) around the current state (model); (ii) propose a new state (model) within the neighbourhood constructed in stage (i). For example, the \(\text {MC}^3\) and locally balanced schemes propose a new model on an identical neighbourhood which consists of models which only differ from the current model in 1 position (i.e. a Hamming neighbourhood), whereas their second stage is a random walk and an informed proposal respectively. The LIT algorithm of Zhou et al. (2021) is similar to the locally balanced scheme whereas it takes an identical neighbourhood generation mechanism to an Add-Delete-Swap scheme but its second stage uses a thresholding function. The design of the neighbourhoods is a crucial factor to the performance of MCMC schemes, especially in those informed schemes for two major reasons. The first reason is the “quality” of neighbourhood in the sense that we should generate neighbourhoods including many promising models. Encouraging better quality neighbourhood construction will improve the mixing of the chain and avoid it getting struck in some low probability models. The second reason is the size of the neighbourhood. Informed MCMC schemes often mix quickly and have good convergence properties, but the computation of each transition can be prohibitively expensive. For example, the number of models in the locally balanced proposal will be at least linear in p and will tend to include large numbers of unimportant variables under standard sparsity assumptions. Neighbourhoods have also been considered previously in the context of stochastic search. Hans et al. (2007) describe a novel Shotgun Stochastic Search (SSS) algorithm whilst Chen et al. (2016) consider a paired-move multiple-try stochastic search algorithm. Both schemes identify a subset of probable models and move to new models within the neighbourhood according to posterior model probabilities.

In this paper we propose a method which generates good neighbourhoods while controlling computational cost with large p by introducing a framework for constructing flexible and efficient MCMC algorithms based on random neighbourhoods. We refer to the scheme as a random neighbourhood sampler and show that if they are well-constructed such schemes can lead to Markov chains with good convergence properties and controlled computational cost per iteration. Our method uses an adaptive scheme to achieve a flexible neighbourhood generating mechanism. Adaptive MCMC is a sub-class of algorithms in which tuning parameters are automatically updated “on the fly” (e.g. Andrieu and Thoms 2008). Several adaptive methods have been developed in the context of BVS (Ji and Schmidler 2013; Lamnisos et al. 2009, 2013). We build on Griffin et al. (2021) who develop the Adaptively-Scaled Individual Adaptation sampler (ASI), which is able to adapt to the importance of each candidate covariate and propose multiple swaps per iteration in high-dimensional settings. We show that the ASI algorithm is a random neighbourhood sampler whose second stage is a random-walk proposal in this paper. Based on this discovery, we design a random neighbourhood informed sampler with the same neighbourhood generating mechanism as ASI but replace its second stage by an informed within-neighbourhood proposal. To illustrate the power of the framework, we develop a new MCMC algorithm for Bayesian variable selection in linear regression, namely the Point-wise Adaptive Random Neighbourhood Informed (PARNI) sampler. This combines the strengths of ASI for good neighbourhood generation and locally informed proposals for avoiding random walk behaviour. An extensive set of empirical results on both real and simulated data-sets show that the PARNI sampler yields good estimates for posterior quantities of interest and performs particularly well for well-known large p examples such as the PCR (\(p=22,575\)) and SNP (\(p=79,748\)) data-sets.

The rest of this paper is structured as follows. In Sect. 2, we review BVS for the linear model along with prior specification. We also briefly describe both the ASI scheme of Griffin et al. (2021) and the locally informed methods of Zanella (2020) and Zhou et al. (2021). In Sect. 3, we characterise the construction of random neighbourhood proposals and illustrate that locally informed proposals and the ASI scheme fall within this framework. Section 4 presents the construction of adaptive random neighbourhood and informed samplers. Following this structure, we present the ARNI and PARNI samplers. In addition, we establish both the ergodicity and a strong law of large numbers for the PARNI algorithm. We implement the PARNI sampler in Sect. 5 on both simulated and real data. Comparisons between the PARNI samplers and other state-of-the-art MCMC algorithms are carried out to showcase their capacity and efficiency. In Sect. 6 we discuss limitations and possible future work. Detailed explanations and proofs are provided in the supplement.

2 Background

2.1 Bayesian variable selection for the linear regression model

Consider a data-set \(\{(y_i,x_{i1},...,x_{ip})\}_{i=1}^n\), where the vector \(y = (y_1,...,y_n) \in \mathbb {R}^n\) is called the response variable and each \(x_j = (x_{1j},...,x_{nj})\) is one of p predictor variables or covariates. The variable selection problem is concerned with finding the best \(q \ll p\) covariates that are most associated with the response. Assuming that each regression includes an intercept, then there are \(2^p\) possible models that can be formulated to predict the response. We refer to each model as \(M_{\gamma }\) where the models are indexed by the indicator variable \(\gamma = (\gamma _1, \ldots , \gamma _p) \in \Gamma = \{0,1\}^p\), where \(\gamma _j = 1\) if the j-th variable is included in model \(M_\gamma \) and \(\gamma _j = 0\) otherwise. We refer to \(\Gamma \) as model space and let \(p_\gamma := \sum _j \gamma _j\). The model \(M_\gamma \) associated with \(\gamma \) is then

$$\begin{aligned} y = \alpha {\textbf {1}}_n + X_\gamma \beta _\gamma + \epsilon \end{aligned}$$
(1)

where \(\epsilon \sim N_n(0, \sigma ^2 I_n)\), y is an n-dimensional response vector, \(X_\gamma \) is an \((n \times p_\gamma )\) design matrix which consists of the “active” variables in \(\gamma \) (those for which \(\gamma _j = 1\)), \(\alpha \) is an intercept term and \(\beta _\gamma \in \mathbb {R}^{p_\gamma }\). In the Bayesian framework, we consider a commonly-used conjugate prior specification

$$\begin{aligned}&p(\alpha ) \propto 1, \quad \beta _\gamma |\gamma , \sigma ^2 \sim N(0, g \sigma ^2 V_\gamma ), \quad p(\sigma ^2) \propto \sigma ^{-2}, \quad \\&\quad p(\gamma ) = h^{p_{\gamma }} (1-h)^{p-p_{\gamma }}. \end{aligned}$$

For simplicity, we can remove the intercept term \(\alpha \) by centering y and \(X_j\) for all j. Chipman et al. (2001) highlight that this method can be motivated from a formal Bayesian perspective by integrating out the coefficients corresponding to those fixed regressors with respect to an improper uniform prior. The covariance matrix \(V_\gamma \) is often chosen as \((X_\gamma ^T X_\gamma )^{-1}\) (a g-prior) or identity matrix \(I_{p_\gamma }\) (an independent prior). In what follows, we will focus on the independence prior where \(V_\gamma = I_{p_\gamma }\). For both of these choices, the marginal likelihood \(p(y|\gamma )\) is analytically tractable. Suitable values for the global scale parameter g are suggested in Fernandez et al. (2001). It can also be driven by a hyperprior, yielding a fully Bayesian model (see Liang et al. (2008) for details). The hyperparameter \(h \in (0,1)\) is the prior probability that each variable is included in the model. Steel and Ley (2007) suggest against using fixed h unless strong information is given, and instead placing a hyperprior on it such as a Beta prior \(h \sim \text {Beta}(a,b)\), leading to a Beta-binomial prior on the model size. The choices of g and h will be specified later for each set of data. In the following sections, we will develop efficient sampling schemes targeting the posterior distribution \(\pi (\gamma ) \propto p(y|\gamma )p(\gamma )\).

Remark 1

For a linear regression model with p candidate covariates, it has been shown that spike-and-slab priors often lead to posterior consistency in the sense that the posterior collapses to a Dirac measure on the true model as more observations are gathered (Fernandez et al. 2001; Liang et al. 2008; Yang et al. 2016), even in high-dimensional setting where p grows with n (Shang and Clayton 2011; Narisetty and He 2014). Another approach is to employ continuous shrinkage priors (e.g. Polson and Scott 2010; Griffin and Brown 2021), which only give posterior inference on regression coefficients but can result in a more computationally tractable posterior distribution.

2.2 Adaptively scaled individual adaptation algorithm

Griffin et al. (2021) introduce a scalable adaptive MCMC algorithm targeting high-dimensional BVS posterior distributions together with a method that automatically updates the tuning parameters. They consider the class of proposal kernels

$$\begin{aligned} q_{\eta } (\gamma , \gamma ^\prime ) = \prod _{j=1}^p q_{\eta , j}(\gamma _j, \gamma _j^\prime ) \end{aligned}$$
(2)

where \(\eta = (A, D) = (A_1, \ldots , A_p, D_1, \ldots , D_p)\), \(q_{\eta , j}(\gamma _j=0, \gamma _j^\prime =1) = A_j\) and \(q_{\eta , j}(\gamma _j = 1, \gamma _j^\prime = 0) = D_j\), with Metropolis-Hastings acceptance probability

$$\begin{aligned} \alpha _\eta (\gamma , \gamma ^\prime ) = \left\{ 1, \frac{\pi (\gamma ^\prime )q_{\eta }(\gamma ^\prime , \gamma )}{\pi (\gamma )q_{\eta }(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$
(3)

This proposal mainly benefits from two aspects. Firstly, the flexibility offered by 2p tuning parameters allows the proposal to be tailored to the data. Secondly, this form of proposal also allows multiple variables to be added or deleted from the model in a single iteration, which in turn allows the algorithm to make large jumps in model space.

figure a

Griffin et al. (2021) suggest an optimal choice of \(\eta = (A, D)\) in Peskun sense while assuming that all variables are independent. If \(\pi _j\) denotes the posterior inclusion probability of the j-th regressor, the optimal choice of \(\eta ^{\text {opt}} = (A^{\text {opt}}, D^{\text {opt}})\) is given as

$$\begin{aligned} A^{\text {opt}}_j = \min \left\{ 1, \frac{\pi _j}{1-\pi _j}\right\} , \quad D^{\text {opt}}_j = \min \left\{ 1, \frac{1-\pi _j}{\pi _j}\right\} . \end{aligned}$$
(4)

The independence assumption is usually violated due to the correlation between regressors and therefore a scaled proposal with parameters \(\eta = \zeta \eta ^{\text {opt}}\) for a scaling parameter \(\zeta \in (0,1)\) is suggested. This scaling parameter \(\zeta \) controls the number of variables that differ between the current state \(\gamma \) and the proposed state \(\gamma '\). Smaller values of \(\eta \) can be used to avoid overly ambitious moves with low probabilities of acceptance and so control the average acceptance rate. They also suggest multiple chain acceleration with common adaptive parameters since running multiple independent chains with shared adaptive parameters can facilitate the convergence of the adaptive parameters (Craiu et al. 2009). This phenomenon is demonstrated in their simulation studies where the schemes with 25 multiple chains outperform the schemes with only 5 multiple chains in terms of relative efficiency especially for large p data-sets. Suppose L chains are used and let \(\gamma ^{l,(i)}\) and \(\gamma ^{l,\prime }\) denote the current state and proposal for the l-th chain respectively. We also defined a vector \(\gamma _{-j} = (\gamma _1, \ldots ,\gamma _{j-1}, \gamma _{j+1},\ldots ,\gamma _{p})\) to denote the vector of \(\gamma \) without \(\gamma _j\). The tuning parameters of the proposal are updated on the fly using a Rao-Blackwellised estimate of the posterior inclusion probability of the j-th regressor which, at the N-th iteration, is

$$\begin{aligned} {\hat{\pi }}^{(N)}_j = \frac{1}{NL} \sum _{i=1}^N \sum _{l=1}^L \frac{\pi (\gamma _j = 1, \gamma ^{l,(i)}_{-j}|y)}{\pi (\gamma _j = 1, \gamma ^{l,(i)}_{-j}|y) + \pi (\gamma _j = 0, \gamma ^{l,(i)}_{-j}|y)}. \end{aligned}$$
(5)

The use of the Rao-Blackwellised estimates of the posterior inclusion probabilities can swiftly distinguish unimportant variables. Griffin et al. (2021) show how these Rao-Blackwellised estimates can be calculated in \(\mathcal {O}(p)\) operations which leads to a scalable MCMC scheme in large p BVS problems. At the i-th iteration, the proposal parameters are \(\eta = \zeta ^{(i)} \times \eta ^{(i)}\) where \(\eta ^{(i)} = (A^{(i)}, D^{(i)})\),

$$\begin{aligned} A^{(i)}_j = \min \left\{ 1, \frac{{\hat{\pi }}^{(i)}_j}{1-{\hat{\pi }}^{(i)}_j}\right\} , \quad D^{(i)}_j = \min \left\{ 1, \frac{1-{\hat{\pi }}^{(i)}_j}{{\hat{\pi }}^{(i)}_j}\right\} \end{aligned}$$
(6)

and the scaling parameter \(\zeta ^{(i)}\) is tuned using the Robbins-Monro scheme

$$\begin{aligned} \text {logit}_\epsilon \zeta ^{(i+1)} = \text {logit}_\epsilon \zeta ^{(i)} + \frac{\phi _i}{L} \sum _{l=1}^L (\alpha _{\zeta ^{(i)} \eta ^{(i)}}(\gamma ^{l,(i)},\gamma ^{l,\prime }) - \tau ) \end{aligned}$$
(7)

for a target rate of acceptance \(\tau \) and the mapping \(\text {logit}_\epsilon :(\epsilon , 1-\epsilon ) \rightarrow \mathbb {R}\) is a modified logistic function (or logit function) defined by

$$\begin{aligned} \text {logit}_\epsilon (x) = \log (x-\epsilon ) - \log (1-x-\epsilon ) \end{aligned}$$
(8)

for some small \(\epsilon \in (0, 1/2)\). The full description of the sampler is given in Algorithm 1. The resulting algorithm is called Adaptively Scaled Individual Adaptation (ASI). Griffin et al. (2021) establish the \(\pi \)-ergodicity and a strong law of large numbers for the ASI sampler.

Remark 2

The performance of the ASI algorithm is crucially related to the choice of appropriate values of parameters and hyperparameters. The parameters \(\eta ^{(i)}\) and \(\zeta ^{(i)}\) are updated on the fly. The hyperparameters are chosen as follows: \(\phi _i = i^{-0.7}\), \(\tau = 0.234\), \(\epsilon = 0.1/p\) and \(\pi _0 = 0.001\). This hyperparameter specification is suggested by Griffin et al. (2021) and they shows that it works well in general based on the empirical performances. See their paper for the discussion on the choice of hyperparmeters.

2.3 Locally informed proposals for discrete-valued variables

In continuous sample space, MCMC algorithms often utilise gradients of the target distribution, e.g. the Metropolis-adjusted Langevin algorithm (Grenander and Miller 1994) and Hamiltonian Monte Carlo (Duane et al. 1987). These methods are defined on continuous spaces but Zanella (2020) develop a class of informed proposals as an analog for discrete spaces. The approach assumes that we can define a random walk Metropolis proposal kernel Q on a neighbourhood \(N \subset \Gamma \) with mass function q. In this paper, we consider the following construction of informed proposals that are described by Zanella (2020) as follows

$$\begin{aligned} q_g(\gamma , \gamma ^\prime ) = {\left\{ \begin{array}{ll} \frac{g\left( \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\right) q(\gamma , \gamma ^\prime )}{Z_g(\gamma )}, \quad &{} \gamma \in N \\ 0, \quad &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

where \(g:[0,\infty )\rightarrow [0,\infty )\) is a monotone continuous weighting function and \(Z_g(\gamma )\) is a normalising constant such that

$$\begin{aligned} Z_g(\gamma ) = \sum _{\gamma ^\prime \in N} g\left( \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\right) q(\gamma , \gamma ^\prime ). \end{aligned}$$
(10)

The choice of the weighting function g is crucial for the performance of \(Q_g\) since it determines how the target distribution \(\pi \) drives the proposal. When g is the constant function \(g(t) = 1\), the resulting informed proposal \(Q_g\) will coincide with the base kernel Q and this is referred to as a non-informed proposal. Zanella (2020) mainly discussed the locally balanced proposals which are formed by the balancing functions that satisfy \(g(t) = tg(1/t)\) for all \(t > 0\). The locally balanced proposals are approximately \(\pi \)-reversible if Q is restricted to local moves. The neighbourhoods are normally chosen to be \(N = \mathcal {H}_m(\gamma ) := \{\gamma ^\prime \in \Gamma | d_H(\gamma ^\prime , \gamma ) \le m\}\) for which \(d_H(\cdot ,\cdot )\) denotes the measure of Hamming distance (i.e. \(d_H(\gamma , \gamma ^\prime ) = \sum _{j=1}^p |\gamma _j - \gamma _j^\prime |\)) and the proposal kernel Q would be a uniform distribution on the neighbourhood N. When m is taken to be 1, the base kernel Q is identical to the \(\text {MC}^3\) sampler. In addition, taking g as the identity function (i.e. \(g(t) = t\)) will lead to a globally balanced proposal \(Q_g\) where \(Q_g\) is \(\pi \)-reversible when the neighbourhood N is the whole sample space.

Theorem 5 of Zanella (2020) shows that using a uniform based kernel on neighbourhood \(\mathcal {H}_m(\gamma )\) combined with a balancing function g as described above will be asymptotically optimal relative to the un-informed or globally balanced proposals, in terms of Peskun ordering, as the dimensionality goes to infinity under the condition that \(\sup _{\gamma \in \Gamma ,\gamma ^\prime \in N} Z_g(\gamma )/Z_g(\gamma ^\prime ) \rightarrow 1\) holds. However, for a Bayesian variable selection problem, Zhou et al. (2021) argue that the behavior of the function \(\gamma \rightarrow Z_g(\gamma )\) is difficult to predict and the assumption may not hold. They therefore suggest a modified weighting function with upper- and lower-bounds

$$\begin{aligned} g(t) = \min \{\max \{p^l, t\}, p^L\} \end{aligned}$$
(11)

where p is the total number of regressors and \(-\infty<l< L < \infty \) are some constants. In what follows, the weighting function in (11) is referred to as the thresholding function. The thresholding function is flexible in the sense that it includes globally and locally balanced functions for specific values of l and L.

Their Locally informed with Thresholded (LIT) algorithm works on neighbourhoods derived from the Add-Delete-Swap scheme and allows the values of l and L to change with the type of move. Under the conditions that the posterior mass concentrations on a small set and the chain starts at a model that is not too far from the true data-generating model, they prove that the LIT algorithm can achieve a dimension-free mixing rate if the parameters of the LIT algorithm are properly selected.

3 Random neighbourhood samplers and the ASI algorithm

Let us recall the idea of a neighbourhood sampler from Sect. 1. In general, the neighbourhoods can be random and tailored to the target distribution \(\pi \). This is referred to as a random neighbourhood sampler. In this section, we will properly present the random neighbourhood sampler in detail and show using Theorem 1 that the ASI sampler is a random neighbourhood sampler.

3.1 Random neighbourhood samplers

We consider a framework for constructing Metropolis-Hastings proposals to sample from \(\pi (\gamma )\) in which a new state is proposed within a random neighbourhood around the current state. The random neighbourhoods are generated using an auxiliary variable k as a neighbourhood indicator. This auxiliary variable k is a discrete random variable defined on a countable set \(\mathcal {K}\) such the probability of generating a neighbourhood \(N = N(\gamma , k)\) is the same as the probability of generating k (i.e. \(p(N|\gamma ) = p(k|\gamma )\)). Suppose \(\gamma \) is the current state and \(Q_k\) is a Metropolis-Hastings proposal kernel (conditioned on k) with mass function \(q_k\). A new state \(\gamma ^\prime \) is drawn from kernel \(Q_k\) after a value of k has been generated. In updating k at each iteration, we usually consider proposing a new state \(k^\prime \) conditional on the current state k through a deterministic bijection \(\rho :\mathcal {K} \rightarrow \mathcal {K}\) such that \(k^\prime = \rho (k)\). The mapping \(\rho \) should be an involution which is a self-inverse function which satisfies \(\rho (\rho (k)) = k\). We call an MCMC algorithm that uses the above construction to generate Metropolis-Hastings proposals a random neighbourhood sampler. The followings are some examples of random neighbourhood samplers.

Example 1

(Samplers with non-stochastic neighbourhoods)

In fact, samplers with non-stochastic neighbourhoods are also random neighbourhood samplers where the specific neighbourhoods are generated with constant probability of 1 at each state \(\gamma \). In such cases, the choices of k and \(\rho \) can be arbitrary. For instance, the \(\text {MC}^3\) sampler can be viewed as a random neighbourhood sampler for which the neighbourhood N consists of models that are 1-Hamming distance from \(\gamma \). In particular, the locally balanced samplers of Zanella (2020) also belong to this class with neighbourhood N as defined in Sect. 2.3.

Example 2

(Add-Delete-Swap sampler and LIT proposal)

In each iteration of an Add-Delete-Swap (ADS) sampler, a strategy from “addition”, “deletion” and “swap” is uniformly chosen which implies that the auxiliary variable k is uniformed distributed over the sample space \(\mathcal {K} = \{\text {``addition''}, \text {``deletion''}, \text {``swap''}\}\) and therefore construct a neighbourhood \(N(\gamma , k)\) as in Yang et al. (2016). A new state \(\gamma ^\prime \) is uniformly proposed from \(N(\gamma , k)\). The corresponding mapping \(\rho \) is then a function that sends the auxiliary variable to an opposite strategy, e.g. it sends “addition” to “deletion” and vice versa. Note that the opposite strategy of “swap” is itself. The Locally Informed and Thresholded (LIT) proposal by Zhou et al. (2021) has an identical neighbourhood construction to an ADS sampler but it proposes a new model using an informed weighted proposal that uses weighting functions bounded above and below.

Example 3

(Hamming ball sampler)

A Hamming ball sampler with radius m is described by Titsias and Yau (2017). This algorithm selects a neighbourhood from \(\mathcal {H}_m(\gamma ) \subset \Gamma \), which is the set of states at most m-Hamming distance away from \(\gamma \). The auxiliary variable k is equivalent to U in their design in which k is uniformly distributed over the set \(\mathcal {K} = \mathcal {H}_m(\gamma )\) and a neighbourhood \(N(\gamma , k) = \mathcal {H}_m(k)\) is used to draw a new state. The Hamming ball sampler proposes a new state according to the truncated posterior model probability in the neighbourhood \(N(\gamma , k)\). In this scheme, the mapping \(\rho \) is the identity function, meaning the same auxiliary variable is used in reversed moves.

The full update of a random neighbourhood sampler uses the three stages below:

  1. (i)

    (Neighbourhood construction) Sample a neighbourhood indicator k from \(p(\cdot |\gamma )\), and construct the corresponding neighbourhood \(N(\gamma , k)\);

  2. (ii)

    (Within-neighbourhood proposal) Propose a new model \(\gamma ^\prime \) in \(N(\gamma , k)\) according to \(Q_k(\gamma , \cdot )\);

  3. (iii)

    (Accept/reject step) Calculate the probability of the reverse move, \(q_{\rho (k)}(\gamma ^\prime , \gamma )\), by constructing the reverse neighbourhood \(N(\gamma ^\prime , \rho (k))\). Move to the new state \(\gamma ^\prime \) with probability \(\alpha _k(\gamma , \gamma ^\prime )\) where \(\alpha _k(\gamma , \gamma ^\prime )\) is the Metropolis-Hastings acceptance probability

    $$\begin{aligned} \alpha _k(\gamma , \gamma ^\prime ) = \min \left\{ 1, \frac{\pi (\gamma ^\prime )p(\rho (k)|\gamma ^\prime )q_{\rho (k)}(\gamma ^\prime , \gamma )}{\pi (\gamma )p(k|\gamma )q_k(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$
    (12)

Throughout this article, we refer to the above three stages as neighbourhood construction, within-neighbourhood proposal and accept/reject step respectively. To preserve the reversibility of the chain, it is better to design a neighbourhood generation scheme where the law

$$\begin{aligned} \gamma ^\prime \in N(\gamma , k) \iff \gamma \in N(\gamma ^\prime , \rho (k)) \end{aligned}$$
(13)

holds for any \(\gamma \), \(\gamma ^\prime \) and k. Upon this law, we assume that the condition

$$\begin{aligned} p(k|\gamma )q_k(\gamma , \gamma ^\prime )> 0 \iff p(\rho (k)|\gamma )q_{\rho (k)}(\gamma ^\prime , \gamma ) > 0 \end{aligned}$$
(14)

is satisfied. This assumption is a generalisation of the paired-move strategy in Chen et al. (2016) and it results in the correctness and reversibility of such a scheme through the following proposition.

Proposition 1

A random neighbourhood sampler is \(\pi \)-reversible provided that condition (14) holds, \(p(k|\gamma )\) is a valid probability measure on \(\mathcal {K}\) and \(q_k(\gamma , \gamma ^\prime )\) is a valid probability measure on neighbourhood \(N(\gamma , k)\) for all \(\gamma \in \Gamma \) and \(k \in \mathcal {K}\).

Remark 3

To generalise the framework of random neighbourhood samplers, it is possible to use a continuous auxiliary variable k. In such a case, the acceptance probability in (12) should include the Jacobian term.

We show in the next part that the ASI sampler is also a random neighbourhood sampler. Unlike the locally balanced proposals, it focuses on constructing sophisticated random neighbourhoods which are more likely to contain promising models and employs a random walk within-neighbourhood proposal.

3.2 Another take on the ASI scheme

It is not straightforward to observe that the ASI sampler is a random neighbourhood sampler, however we show below that, in fact, it can be. To do so, we introduce a random neighbourhood sampler, the Adaptive Random Neighbourhood (ARN) sampler, and prove that the ARN and ASI samplers are equivalent if they share some common adaptive parameters. The ARN sampler follows a random walk within neighbourhood but, compared to the locally informed approach, puts more efforts into neighbourhood construction.

We consider a random neighbourhood sampler with algorithmic tuning parameter \(\theta = (\xi \eta ^{\text {opt}}, \omega ) \in (\epsilon , 1-\epsilon )^{2p+1} := \Delta _\epsilon ^{2p+1}\) and a small \(\epsilon \in (0, 1/2)\), where \(\eta ^{\text {opt}}\) is given in (4), and the tuning parameters \(\xi \) and \(\omega \) are used in the random neighbourhood construction and the within-neighbourhood proposal respectively. In the random neighbourhood construction, the neighbourhood indicator variable \(k = (k_1, \ldots , k_p) \in \mathcal {K} = \{0,1\}^p\) is generated from the distribution

$$\begin{aligned} p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k| \gamma ) =&\prod _{j = 1}^p p^{\text {RN}}_{\xi \eta ^{\text {opt}}, j}(k_j| \gamma _j) \end{aligned}$$
(15)

where \(p^{\text {RN}}_{\xi \eta ^{\text {opt}},j}(k_j=1|\gamma _j=0) = \xi A_j^{\text {opt}}\) and \(p^{\text {RN}}_{\xi \eta ^{\text {opt}},j}(k_j=1|\gamma _j=1) = \xi D_j^{\text {opt}}\). This is equivalent to the ASI proposal in (2) where \(k_j=1\) if and only if \(\gamma _j \ne \gamma _j^\prime \). A neighbourhood \(N(\gamma , k)\) is obtained from \(\gamma \) and k for which \(\gamma \) is the “centre” of \(N(\gamma , k)\) and k indicates the possible indices altered from \(\gamma \). These tuning parameters \(\xi \) and \(\eta ^{\text {opt}}\) are abortively updated on the fly. For any \(\gamma ^* \in N(\gamma , k)\), \(k_j = 0\) implies that \(\gamma ^*_j = \gamma _j\). This identity can be used to state a formal definition of the neighbourhood \(N(\gamma , k)\) as

$$\begin{aligned} N(\gamma , k) = \{ \gamma ^* \in \Gamma | \gamma _j = \gamma _j^*, ~ \forall k_j = 0 \}. \end{aligned}$$

The neighbourhood contains \(2^{p_k}\) models where \(p_k\) is the number of 1s in k (i.e. \(p_k := \sum _{j=1}^p k_j\)). The parameter \(\xi \) affects the \(p_k\) and therefore controls neighbourhood size. So we call \(\xi \) the neighbourhood scaling parameter.

The mapping \(\rho \) is chosen to be the identity function. The within-neighbourhood proposal in this adaptive random neighbourhood scheme is also based on the same proposal in (2) over the neighbourhood \(N(\gamma , k)\). It can be characterised as choosing the variables to be added or deleted from the model by thinning from within the set \(\{j | ~ k_j = 1 \}\) with the thinning probability set to be \(\omega \in (0,1)\). We refer to this parameter as the unique within-neighbourhood proposal tuning parameter. A larger value of \(\omega \) increases the probability of proposing \(\gamma ^\prime \) further away from \(\gamma \) in Hamming distance. This can be written formally as the proposal in (2) with tuning parameter \(\eta ^{\text {THIN}} = (A^{\text {THIN}}, D^{\text {THIN}}) = (\omega k,\omega k)\), that is \(A^{\text {THIN}}_j = D^{\text {THIN}}_j = \omega \) for \(k_j = 1\) and \(A^{\text {THIN}}_j = D^{\text {THIN}}_j = 0\) otherwise. The resulting proposal is termed as \(q^{\text {THIN}}_{\omega , k}\) which is formulated as

$$\begin{aligned} q_{\omega , k}^{\text {THIN}}(\gamma , \gamma ^\prime ) = \prod _{j=1}^p q_{\omega , k_j}^{\text {THIN}}(\gamma _j, \gamma ^\prime _j), \end{aligned}$$
(16)

where \(q_{\omega , 1}^{\text {THIN}}(\gamma _j, 1-\gamma _j) = \omega \) and \(q_{\omega , 0}^{\text {THIN}}(\gamma _j, 1-\gamma _j) = 0\). The proposal \(q_{\omega , k}^{\text {THIN}}\) is symmetric and only generates new states inside the neighbourhood \(N(\gamma , k)\). This is because the probabilities of proposing flips on coordinates other than j such that \(k_j = 1\) are 0. The scheme is completed by accepting or rejecting the proposal using a standard Metropolis-Hastings acceptance probability

$$\begin{aligned} \alpha _{\theta , k}^{\text {ARN}} (\gamma , \gamma ^\prime ) = \left\{ 1, \frac{\pi (\gamma ^\prime ) p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime ) q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma ) }{\pi (\gamma ) p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) } \right\} . \end{aligned}$$
(17)

Remark 4

An alternative formulation to (16) in terms of Hamming distance between \(\gamma \) and \(\gamma ^\prime \) is

$$\begin{aligned} q_{\omega , k}^{\text {THIN}}(\gamma , \gamma ^\prime )&= \omega ^{d_H(\gamma , \gamma ^\prime )} (1-\omega )^{p_k - d_H(\gamma , \gamma ^\prime )} \mathbb {I}\{ \gamma ^\prime \in N(\gamma , k)\} \nonumber \\&= \left( \frac{\omega }{1-\omega }\right) ^{d_H(\gamma , \gamma ^\prime )} (1-\omega )^{p_k} \mathbb {I}\{ \gamma ^\prime \in N(\gamma , k)\} \end{aligned}$$
(18)

where \(d_H(\gamma , \gamma ^\prime )\) is the measure of Hamming distance between two models \(\gamma \) and \(\gamma ^\prime \).

Remark 5

When \(\omega \) is chosen to be 1/2, the within-neighbourhood proposal \(q^{\text {THIN}}_{\omega = 1/2, k}\) is uniformly distributed over the local neighbourhood \(N(\gamma , k)\).

figure b

Algorithm 2 describes how a new state \(\gamma ^\prime \) is proposed using the ARN scheme. We indicate the transition kernel by \(p^{\text {ARN}}_{\theta }\) and the corresponding sub-transition kernel conditional on k by \(p^{\text {ARN}}_{\theta , k}\). They obey the relationship

$$\begin{aligned} p^{\text {ARN}}_{\theta }(\gamma , \gamma ^\prime ) = \sum _{k \in \mathcal {K}} p^{\text {ARN}}_{\theta , k}(\gamma , \gamma ^\prime ). \end{aligned}$$

The following proposition helps to show that the ARN sampler is \(\pi \)-reversible.

Proposition 2

For any tuning parameter \(\theta = (\eta , \omega ) \in \Delta _\epsilon ^{2p+1} = (\epsilon , 1-\epsilon )^{2p+1}\), the condition (14) holds, the conditional distribution of k, \(p^{\text {RN}}_\eta (k|\gamma )\), within the ARN sampler is a valid distribution on \(\mathcal {K} = \{0,1\}^p\). In addition, for any \(\gamma \in \Gamma \) and \(k \in \mathcal {K}\), the within-neighbourhood proposal of the ARN sampler \(q^{\text {THIN}}_{\omega , k}(\gamma ,\gamma ^\prime )\) is also a valid probability distribution on \(N(\gamma ,k)\).

Proposition 1 together with Proposition 2 show that the ARN transition kernel is \(\pi \)-reversible and therefore generates samples that preserve the target distribution \(\pi \). In fact ARN and ASI are mathematically equivalent provided that the tuning parameter choices are made in a prescribed manner. To see this suppose that the tuning parameters of both the ARN and ASI schemes are fixed and share the same tuning parameter \(\eta \). The following theorem shows that their transition probabilities from \(\gamma \) to \(\gamma ^\prime \) are equal when \(\zeta = \xi \times \omega \) holds.

Theorem 1

Suppose that \(\eta \in \Delta _\epsilon ^{2p}\) and \(\zeta \), \(\xi \), \(\omega \in \Delta _\epsilon \) for small \(\epsilon \in (0, 1/2)\), \(p^{\mathrm {ARN}}_{(\xi \eta ,\omega )}\) and \(p^{\mathrm {ASI}}_{\zeta \eta }\) are transition kernels of the ARN and ASI schemes respectively. If \(\zeta = \xi \times \omega \) and, then

$$\begin{aligned} p^{\mathrm {ARN}}_{(\xi \eta ,\omega )} (\gamma , \gamma ^\prime ) = p^{\mathrm {ASI}}_{\zeta \eta } (\gamma , \gamma ^\prime ) \end{aligned}$$
(19)

holds for any \(\gamma \) and \(\gamma ^\prime \in \Gamma \).

In addition we deduce the following corollary.

Corollary 1

Setting \(\xi _1 \times \omega _1 = \xi _2 \times \omega _2\) implies

$$\begin{aligned} p^{\mathrm {ARN}}_{(\xi _1\eta , \omega _1)} (\gamma , \gamma ^\prime ) = p^{\mathrm {ARN}}_{(\xi _2\eta , \omega _2)} (\gamma , \gamma ^\prime ) \end{aligned}$$

for any \(\gamma \) and \(\gamma ^\prime \in \Gamma \).

Corollary 1 shows that two ARN kernels with different tuning parameters coincide in probability if the products of the neighbourhood scaling parameter \(\xi \) and proposal thinning parameter \(\omega \) are equal. This corollary also suggests that magnitudes of \(\xi \) and \(\omega \) can shift to each other without modifying the resulting proposal as long as their product preserves.

4 Adaptive random neighbourhood and informed samplers

It should be clear from the above discussion that both the locally informed proposals and ASI schemes can be viewed as random neighbourhood samplers, and that the former focuses on selecting good proposals within a neighbourhood, while the latter focuses on constructing neighbourhoods of models which are more likely to be accepted in the Metropolis-Hastings update. Our main methodological contribution is to design a random neighbourhood sampler for which both the neighbourhood construction and within-neighbourhood proposal are designed in an informed way. We therefore consider using an adaptive random neighbourhood approach to construct neighbourhoods, followed by a locally informed approach to select a proposal from this neighbourhood.

The advantages of combining the two schemes in this manner are worth highlighting. A key strength of ASI is that generating proposals is computationally cheap, but when components of the posterior distribution are highly correlated then the assumption of independence that is embedded into the proposal generation can lead to overly ambitious moves that will be rejected. To combat this, the scaling parameter must be used to control the acceptance rate, but in the presence of high correlation this can lead to small moves and slow mixing. The locally informed sampler can cope well with high levels of correlation in the posterior distribution, but the (un-informed) neighbourhood in high-dimensions often, either contain no sensible models, or be so large that the cost of computing all of the posterior probability of models within it becomes prohibitive. Combining the two schemes is therefore an attractive proposition, as an intelligent neighbourhood that is also not too large can be constructed using ASI, and then correlation can be controlled for at the second stage by choosing the within-neighbourhood proposal using the locally informed approach.

We give the details of this adaptive random neighbourhood and informed sampler below, which we call the Adaptive Random Neighbourhood Informed (ARNI) sampler. After this we define the point-wise ARNI (PARNI) scheme, which enjoys the benefits of ARNI but with much lower computational cost.

4.1 Adaptive random neighbourhood informed algorithm

We first describe a general construction of the random neighbourhood informed proposals. Suppose a random neighbourhood sampler is given with neighbourhood indicator variable \(k \in \mathcal {K}\) and a update mapping \(\rho \) together with a within-neighbourhood proposal kernel \(Q_k\). The variable k follows a conditional distribution \(p(k|\gamma )\) whereas the proposal \(Q_k\) produces a new state \(\gamma ^\prime \) within neighbourhood \(N(\gamma , k)\) in an uninformed manner. We consider a class of random neighbourhood informed proposals \(Q_{g,k}\) with mass function

$$\begin{aligned} q_{g,k}(\gamma , \gamma ^\prime ) = {\left\{ \begin{array}{ll} \frac{g\left( \frac{\pi (\gamma ^\prime )p(\rho (k)|\gamma ^\prime )q_{\rho (k)}(\gamma ^\prime , \gamma )}{\pi (\gamma )p(k|\gamma )q_k(\gamma , \gamma ^\prime )}\right) q_k(\gamma , \gamma ^\prime )}{Z_{g, k}(\gamma )}, &{} \gamma \in N(\gamma , k) \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(20)

where \(g:[0,\infty )\rightarrow [0,\infty )\) is a continuous monotone weighting function, and \(Z_{g, k}(\gamma )\) is a normalising constant defined by

$$\begin{aligned}&Z_{g, k}(\gamma )\nonumber \\&\quad = \sum _{\gamma ^*\in N(\gamma , k)} g\left( \frac{\pi (\gamma ^*)p(\rho (k)|\gamma ^*)q_{\rho (k)}(\gamma ^*, \gamma )}{\pi (\gamma )p(k|\gamma )q_k(\gamma , \gamma ^*)}\right) q_k(\gamma , \gamma ^*). \end{aligned}$$
(21)

The generated new state \(\gamma ^\prime \) is accepted using the Metropolis-Hastings rule

$$\begin{aligned} \alpha _{g, k}(\gamma , \gamma ^\prime )&= \min \left\{ 1, \frac{\pi (\gamma ^\prime )p(\rho (k)|\gamma ^\prime )q_{g,\rho (k)}(\gamma ^\prime , \gamma )}{\pi (\gamma )p(k|\gamma )q_{g,k}(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$
(22)

The proposal collapses to the locally balanced proposal of Zanella (2020) when the neighbourhood is non-stochastic, the weighting function g is a balancing function that satisfies \(g(t) = tg(1/t)\) and the within-neighbourhood proposal is symmetric. In what follows, we combine the above random neighbourhood informed proposal with the ARN scheme and develop an Adaptive Random Neighbourhood Informed (ARNI) proposal that uses an informed proposal at the within-neighbourhood proposal stage. In the ARNI scheme, the mapping \(\rho \) is chosen to be the identity function where \(\rho (k) = k\) and the within-neighbourhood proposal in Algorithm 2 is replaced by

$$\begin{aligned} q^{\text {ARNI}}_{\theta , k}(\gamma , \gamma ^\prime )&\propto g\left( \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime )q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma )q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime )}\right) \nonumber \\ \&\quad q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) \nonumber \\&= g\left( \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma )}\right) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ). \end{aligned}$$
(23)

for some weighting function g and some parameters \(\theta = (\xi \eta ^{\text {opt}}, \omega ) \in \Delta _\epsilon ^{2p+1} = (\epsilon ,1-\epsilon )^{2p+1}\). The last equation follows since the within-neighbourhood proposal \(q^{\text {THIN}}_{\omega , k}\) is symmetric and therefore \(q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma )/q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) = 1\) for all \(\gamma ^\prime \in N(\gamma , k)\). The Metropolis-Hastings acceptance probability is tailored to the new informed proposal as

$$\begin{aligned} \alpha _{\theta , k}^{\text {ARNI}}(\gamma , \gamma ^\prime ) = \min \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime )q^{\text {ARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma )q^{\text {ARNI}}_{\theta , k}(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$
(24)

The optimal choice of informed weighting function is unclear in the ARNI scheme. The thresholding function is not appropriate since the neighbourhoods generated by ARNI cannot be divided into the addition and deletion neighbourhoods as in the LIT scheme. We therefore recommend to use a balancing function which satisfies \(g(t) = tg(1/t)\) and form an ARNI balanced proposal.

To boost the convergence of these adaptive tuning parameters, the same multiple chain strategy as ASI should be implemented. In addition to the notations used in ASI, \(k^{l,(i)}\) denotes the neighbourhood indicator variable for the l-chain at time i. For L multiple chains, the tuning parameters \(\eta ^{\text {opt}}\) are updated following the same scheme of ASI as in (6) and (7). Two scaling parameters \(\xi \) and \(\omega \) can be updated using the Robbins-Monro schemes

$$\begin{aligned} \text {logit}_\epsilon \xi ^{(i+1)}&= \text {logit}_\epsilon \xi ^{(i)} + \frac{\phi _i}{L} \sum _{l=1}^L (p_{k^{l,(i)}} - s) \end{aligned}$$
(25)
$$\begin{aligned} \text {logit}_\epsilon \omega ^{(i+1)}&= \text {logit}_\epsilon \omega ^{(i)} + \frac{\phi _i}{L} \sum _{l=1}^L (\alpha ^l_i - \tau ) \end{aligned}$$
(26)

where \(p_k\) is the size of k as mentioned previously, s is the target size of k, \(\alpha ^l_i\) is the acceptance probability at the ith iteration for the l-th chain and \(\tau \) is the target average acceptance rate.

Remark 6

For practical convenience, it is often useful to chose the diminishing sequence \(\phi _i\) of the form \(\phi _i = i^{-\lambda }\) for \(\lambda \in (1/2, 1)\) since the condition \(\phi _i = \mathcal {O}(i^{-\lambda })\) is not be violated by this choice of \(\phi _i\). Choosing \(\lambda > 1\) would result in finite adaptation (Roberts and Rosenthal 2007) in which the adaptation stops after a finite stopping time, and using \(\lambda < 1/2\) is uncommon because of finite sample stability concerns. We therefore recommend using \(\phi _i = i^{-0.7}\) for both updating schemes. See Remark 3 in Griffin et al. (2021) for further discussion.

While the informed proposal is powerful in accelerating the convergence of the chains, it also introduces extra computational costs since the posterior probabilities of all models in a neighbourhood are required. Given a k of size \(p_k\), the resulting neighbourhood \(N(\gamma , k)\) consists of \(2^{p_k}\) models. Although it is possible to speed up the posterior calculations using Gray codes as introduced in George and McCulloch (1997), evaluating \(2^{p_k}\) models is still computationally expensive when \(p_k\) is very large and leads to an inefficient scheme. One way to address the issue is to tune the neighbourhood scaling parameter to generate neighbourhoods with a desired size, say let s be 5. In our experience, such control of the size of k comes at the cost of reduced exploration of the model space and the ARNI scheme fails to achieve better performance than ASI. This motivated us to develop a more efficient implementation of this approach that controls computational cost but maintains good exploration properties.

4.2 The PARNI sampler

We consider a point-wise implementation of the ARNI scheme (for short, the PARNI scheme). This approach is motivated by the block-wise implementation in Zanella (2020) and the block design strategy in Titsias and Yau (2017). The main idea is that a large neighbourhood is divided into a series of smaller blocks and the new model is proposed by sequentially adding or deleting variables in each block. The block design can lead to a significant reduction in the total number of models considered and so require less computational effort. For instance, suppose that there are \(p_k\) non-zero neighbourhood indicator variables, which are divided into m equally sized blocks. The neighbourhoods generated by each block will have \(2^m\) models. Working through each block to propose a new state requires evaluating \(2^m p_k/m\) posterior probabilities. As the computational cost is proportional to the total number of models considered, the computational cost is largest when \(m = p_k\) where the only building block is the entire neighbourhood of \(N(\gamma , k)\). In contrast, the smallest computational cost occurs when \(m=1\) where each block has one variable and therefore contains two models only. Throughout the section, we consider the latter block design when \(m=1\) and the resulting algorithm is the PARNI sampler.

4.2.1 Main algorithm

We now formally present the PARNI algorithm and show how a new model \(\gamma ^\prime \) is proposed from the current model \(\gamma \). We use the same random neighbourhood construction as the ARNI scheme, in addition, the neighbourhood scaling parameter \(\xi \) is set to be fixed at 1 to indicate that neighbourhood sizes are not reduced at this stage. In other words the neighbourhoods are generated with the optimal values \(\eta ^{\text {opt}}\) as in (4). After a neighbourhood \(N(\gamma , k)\) is sampled, we sequentially propose new models with only 1-Hamming distance differences inside \(N(\gamma , k)\). We define \(K = \{K_1, \ldots , K_{p_k}\} = \{j|k_j = 1\}\) to be the set of variables for which \(k_j=1\) (the order of variables is random). We also define a sequence of models, \(\gamma (1), \ldots , \gamma (p_k)\) and neighbourhoods, \(N(1),\ldots , N(p_k)\) to sample the final proposal \(\gamma ^\prime \). To introduce more flexibility, we allow different weighting functions for each sub-proposal so \(p_k\) weighting functions \(g_1, \ldots , g_{p_k}\) are defined. Finally, let \(e(1),\ldots ,e(p)\) be the basis vector of a p-dimensional Cartesian space where \(e(j)_j = 1\) and \(e(j)_{j^\prime } = 0\) whenever \(j^\prime \ne j\). We consider the neighbourhoods constructed according to \(\gamma (j)\) and \(e(K_r)\) for r from 1 to \(p_k\). The first neighbourhood is \(N(1)=N(\gamma , e(K_1))\) and propose a model \(\gamma (1)\) from

$$\begin{aligned} q^{\text {PARNI}}_{\theta ,K_1}(\gamma , \gamma (1)) \propto {\left\{ \begin{array}{ll} g_1\left( \frac{\pi (\gamma (1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_1)|\gamma (1))}{\pi (\gamma )p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_1)|\gamma )}\right) q^{\text {THIN}}_{\omega , e(K_1)}(\gamma , \gamma (1)), \quad &{} \text { if }\gamma (1) \in N(1) \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(27)

for some algorithmic parameters \(\theta = (\eta ^{\text {opt}}, \omega ) \in \Delta _\epsilon ^{2p+1}\). We repeat this process to construct the second neighbourhood \(N(2) = N(\gamma (1), e(K_2))\) and propose the model \(\gamma (2)\) from N(2). In general, at time r, we defined \(N(r) = N(\gamma (r-1), e(K_r))\) and propose a model \(\gamma (r)\) from

$$\begin{aligned} q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r)) \propto {\left\{ \begin{array}{ll} g_r\left( \frac{\pi (\gamma (r))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma (r))}{\pi (\gamma (r-1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma (r-1))}\right) q^{\text {THIN}}_{\omega , e(K_r)}(\gamma (r-1), \gamma (r)), \quad &{} \text { if }\gamma (r) \in N(r) \\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(28)

Each sub-proposal above only allows the value in position \(K_r\) to change. Figure 1 provides a flowchart of the PARNI scheme which only involves enumerating at most \(2p_k\) models rather than \(2^{p_k}\) models in the ARNI proposal. The parameters of the proposal are \(\theta = (\eta ^{\text {opt}}, \omega )\).

To construct a \(\pi \)-reversible chain, the probability of the reverse moves is required. These reverse moves use \(K^\prime = \rho (K)\) as their auxiliary variables. The mapping \(\rho \) reverses the order of elements in K so that the variable \(K^\prime \) contains the same elements in K but with reverse order. The typical benefit is that it leads to identical intermediate models of forward and reverse proposals and the posterior probabilities of \(p_k\) models are required instead of \(2 p_k\). Suppose that \(\gamma ^\prime (r)\) for \(r = 0 ,\ldots , p_k\) are consecutive intermediate models used in the reverse move and \(N^\prime (r)\) for \(r = 0 ,\ldots , p_k\) are the neighbourhoods used in the reverse move. These models and neighbourhoods are identical to those ones used in the proposal move but with opposite order, in particular \(\gamma ^{\prime }(r) = \gamma (p_k-r)\) for \(r = 0 ,\ldots , p_k\) and \(N^\prime (r) = N(p_k-r+1)\) for \(r = 1 ,\ldots , p_k\). The second benefit is that the design leads to a simpler form of the Metropolis-Hastings probability of acceptance. Let Z(r) be the normalising constant of the r-th sub-proposal, \(q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r))\), and \(Z^\prime (j)\) denote the normalising constant of r-th sub-proposal in the reverse move, \(q^{\text {PARNI}}_{\theta ,K^\prime _r}(\gamma ^{\prime }(r-1), \gamma ^{\prime }(r))\) with weighting functions \(g^\prime _r\). We have that

Fig. 1
figure 1

Flowcharts of the pointwise implementation of adaptive random neighbourhood informed proposal in one iteration. Top panel: proposed direction. Bottom panel: reversed direction. The black neighbourhoods \(N(\gamma , k)\) and \(N(\gamma ^\prime , k)\) are the original large neighbourhoods. The red neighbourhoods N(r) and \(N^\prime (r)\) are subsequent small neighbourhoods used for each intermediate proposals. The orange model \(\gamma \) is the current state and the cerise model \(\gamma ^\prime \) is the final proposal. The blue models \(\gamma (r)\) and \(\gamma ^{\prime }(r)\) are intermediate models. The light blue arrows indicate the position-wise proposals. (Color figure online)

$$\begin{aligned} \begin{aligned} Z(r) =&\sum _{\gamma ^* \in N(r)} g_r\left( \frac{\pi (\gamma ^*)p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma ^*)}{\pi (\gamma (r-1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma (j-1))}\right) \\&\quad q^{\text {THIN}}_{\omega , e(K_r)}(\gamma (j-1), \gamma ^*) \\ Z^\prime (r) =&\sum _{\gamma ^* \in N^\prime (r)} g^\prime _r\left( \frac{\pi (\gamma ^*)p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K^\prime _r)|\gamma ^*)}{\pi (\gamma ^\prime (r-1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K^\prime _r)|\gamma ^\prime (j-1))}\right) \\&\quad q^{\text {THIN}}_{\omega , e(K^\prime _r)}(\gamma ^\prime (j-1), \gamma ^*). \end{aligned} \end{aligned}$$
(29)

We let \(q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )\) be the full proposal kernel that satisfies

$$\begin{aligned} q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \prod _{r=1}^{p_k} q^{\text {PARNI}}_{\theta , K_j}(\gamma (r-1), \gamma (r)) \end{aligned}$$
(30)

where \(\gamma (0)\) is current state \(\gamma \) and \(\gamma (p_k)\) is the final proposal \(\gamma ^\prime \). The Metropolis-Hastings acceptance probability of the PARNI proposal is given as

$$\begin{aligned} \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\eta ^{\text {opt}}}(k|\gamma ^\prime )q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\eta ^{\text {opt}}}(k|\gamma )q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )} \right\} . \end{aligned}$$
(31)

In specifying these weighting functions \(g_r\) for \(r = 1, \ldots , p_{k}\), because each sub-proposal in the PARNI proposal can be treated as addition/deletion move, it is feasible to choose the thresholding function as LIT in Zhou et al. (2021) for different moves. We consider the following thresholded weighting function

$$\begin{aligned} g_r(t) = {\left\{ \begin{array}{ll} \min \{\max \{p^{-1}, t\}, p\}, &{}\text { if }\gamma (r)_{K_r} = 0\\ \min \{\max \{p^{-1}, t\}, 1\}, &{}\text { if }\gamma (r)_{K_r} = 1 \end{array}\right. } \end{aligned}$$
(32)

for \(r = 1, \ldots , p_{k}\). The weighting functions in the reverse move are defined similarly. Alternatively, we can also use a balancing function g in PARNI. The choice of balancing function mainly focuses on three particular candidates: square root function \(g_{\text {sq}} (t) = \sqrt{t}\), Hastings’ choice \(g_{\text {H}}(t)=\min \{1,t\}\) and Barker’s choice \(g_{\text {B}} = t/(1+t)\). The comparisons of these balancing functions in Supplement B.1.3 of Zanella (2020) illustrate two major findings. The Hastings’ and Barker’s choices only differ by at most a factor of 2 due to their similar asymptotic behaviors. The square root function mixes the worst outside the burn-in phase. Therefore, we consider the Hastings’ choice throughout (i.e.

$$\begin{aligned} g_r(t) = \min \{1,t\} \end{aligned}$$
(33)

for all \(r = 1, \ldots , p_{k}\)) in the rest of the paper. Similar results are also expected for the Barker’s choice. Using the balancing function would lead to a simpler form of the Metropolis-Hastings acceptance probability and this is illustrated by the following proposition:

Proposition 3

Suppose \(\gamma \), \(\gamma ^\prime \in \Gamma \) are fixed. For any \(\theta = (\eta , \omega ) \in \Delta _\epsilon ^{2p+1}\) and k such that \(\gamma ^\prime \in N(\gamma , k)\), if the weighting function \(g_r\) satisfies \(g_r(t)=tg_r(1/t)\) for all r, the Metropolis-Hastings acceptance probability in (31) can be written

$$\begin{aligned} \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \min \left\{ 1, \prod _{r=1}^{p_k} \frac{Z(r)}{Z^\prime (r)} \right\} \end{aligned}$$
(34)

where Z(r), \(Z^\prime (r)\) for \(r = 1, \cdots , p_k\) are the normalising constants given in (29).

The PARNI proposal which uses thresholding function is referred to as PARNIT whereas one uses balancing function is referred to as PARNIB.

4.2.2 Adaptation schemes for algorithmic parameters

The last building block to complete the PARNI sampler is the adaptation mechanism of the tuning parameters. The posterior inclusion probabilities \(\pi _j\) are updated as in the ASI scheme in (5). The magnitude of the proposal thinning parameter \(\omega \) is crucial in the mixing time and convergence rate of chains. Therefore, we consider two adaptation schemes for updating \(\omega \), the Robins-Monro adaptation scheme (RM) and the Kiefer–Wolfowitz adaptation scheme (KW). For the rest of section, we assume L multiple chains are used for the PARNI sampler.

The Robbins-Monro adaptation scheme is widely used in updating tuning parameters of adaptive MCMC rithms. Andrieu and Thoms (2008) review several adaptive MCMC algorithms using variants of the Robbins-Monro process. Given a specified probability of acceptance \(\tau \), the Robbins-Monro adaptation scheme automatically adjusts \(\omega \) according to the comparison between the current probability of acceptance and \(\tau \). It is generally considered to be a robust adaption scheme. Given the acceptance probability of the l-th chain at the i-th iteration \(\alpha ^l_i\), the tuning parameter \(\omega \) is updated through the law

$$\begin{aligned} \text {logit}_\epsilon \omega ^{(i+1)} = \text {logit}_\epsilon \omega ^{(i)} + \frac{\phi _i}{L}\sum _{l=1}^L(\alpha ^l_i - \tau ). \end{aligned}$$
(35)

for \(\phi _i = O(i^{-\lambda })\) for some constant \(1/2< \lambda < 1\). The theoretical optimal value of \(\tau \) may not exist for every candidate proposal kernel and choice of posterior distribution. We recommend using the diminishing sequence \(\phi _i = i^{-0.7}\) and using a target acceptance rate of 0.65 based on a large number of experiments that will be illustrated in Sect. 5.

Apart from the above Robbins-Monro scheme, the Kiefer–Wolfowitz scheme is another possible adaption in tuning \(\omega \) for the PARNI sampler. The Kiefer–Wolfowitz scheme is a stochastic approximation algorithm and modification of the Robbins-Monro scheme in which a finite difference approximation to the derivative is used. In this scheme the tuning parameter is updated to target the optimiser of an objective function of interest. According to the work of Pasarica and Gelman (2010), one can use the expected squared jumping distance as the objective function because the expected squared jumping distance is closely linked to the mixing and convergence properties of a Markov chain. The expected squared jumping distance can be estimated by the average squared jumping distance. An alternative objective function would be the generalised speed measure introduced in Titsias and Dellaportas (2019).

To estimate the finite difference approximation to the derivative of the average squared jumping distance, we exploit the multiple chain implementation of PARNI. The multiple independent chains naturally provide independent samples which fits the Kiefer–Wolfowitz approximation. Our implementation of the Kiefer–Wolfowitz adaption scheme proceeds as follows. We first evenly divide L multiple chains into two equally sized batches, \(L^+\) and \(L^-\). Let \(c_i\) be a diminishing sequence, new proposals are generated using \(\omega ^+ = \omega ^{(i)} + c_i\) for chains in \(L^+\) and \(\omega ^- = \omega ^{(i)} - c_i\) for chains in \(L^-\). The average squared jumping distances for these batches (i.e. \(\text {ASJD}^{+,(i)}\) and \(\text {ASJD}^{-,(i)}\)) are estimated using the new proposals and their corresponding probabilities of acceptance. The tuning parameter \(\omega \) is then updated according to the rule

$$\begin{aligned} \text {logit}_\epsilon \omega ^{(i+1)} = \text {logit}_\epsilon \omega ^{(i)} + a_i \left( \frac{\text {ASJD}^{+,(i)} - \text {ASJD}^{-,(i)}}{2c_i}\right) . \end{aligned}$$
(36)

We suggest using \(a_i = i^{-1}\) and \(c_i = i^{-0.5}\) in the Kiefer–Wolfowitz scheme. Further details of the Kiefer–Wolfowitz adaption scheme are given in A.1 of the supplementary material and a feasibility analysis of the Kiefer–Wolfowitz adaption scheme is carried out in C.2 of the supplementary material.

Remark 7

Blum (1954) show the Kiefer–Wolfowitz scheme converges if the diminishing sequences \(a_i\) and \(c_i\) satisfy \(\sum _{i=0}^\infty a_i^2 c_i^{-2} = \infty \). According to Remark 6, the sequences \(a_i\) and \(c_i\) should have diminishing rate between \(-0.5\) and \(-1\). Therefore, the only possible pair would be \(a_i = i^{-1}\) and \(c_i = i^{-0.5}\).

Remark 8

Alternative to adapting the thinning parameter \(\omega \) through the above adaptive schemes, one can set \(\omega \) to a fixed value of 1/2 for simplicity and the base kernel \(q^{\text {THIN}}\) becomes uniformly distributed. Note that fixing \(\omega \) at 1/2 does not necessarily lead to optimal mixing for the PARNI scheme.

Pseudocode of the PARNI samplers are given in Algorithm 3. The corresponding transition kernel is referred to as \(p_{\theta }^{\text {PARNI}(*)-\bullet }\) for \(* = \text {T}\) or \(\text {B}\) and \(\bullet = \text {RM}\) or \(\text {KW}\). In the next section we show that the PARNI sampler is \(\pi \)-ergodic and satisfy a strong law of large numbers.

figure c

4.2.3 Ergodicity and strong law of large numbers

The multiple chain acceleration can be thought of the realisation of L runs on a product space \(\Gamma ^{\otimes L}\) with joint variable \(\gamma ^{\otimes L} = (\gamma ^{1}, \ldots , \gamma ^{L}) \in \Gamma ^{\otimes L}\). Without loss of generality, suppose further that \(L \ge 1\) for the Robbins-Monro adaptation scheme and \(L \ge 2\) for the Kiefer–Wolfowitz adaptation scheme. We consider a posterior distribution \(\pi \) on the space \(\Gamma \) which is of the form

$$\begin{aligned} \pi (\gamma ) \propto p(y|\gamma ) p(\gamma ) \end{aligned}$$
(37)

where both \(p(y|\gamma )\) and \(p(\gamma )\) are analytically available. In addition, the joint posterior distribution \(\pi ^{\otimes L}\) on the product set \(\Gamma ^{\otimes L}\) is given as

$$\begin{aligned} \pi ^{\otimes L}(\gamma ^{\otimes L}) = \prod _{l=1}^L \pi (\gamma ^{l}). \end{aligned}$$
(38)

In this section, the symbol \(*\) denotes either T or B and the symbol \(\bullet \) represents either KW or RM. The sub-proposal mass function of the \(\text {PARNI}(*)-\bullet \) sampler given neighbourhood indicator variable k and tuning parameter \(\theta = (\eta , \omega )\) is defined by

$$\begin{aligned} \psi ^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime )= p^{\text {RN}}_\eta (k|\gamma )q^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime ). \end{aligned}$$
(39)

The full transition kernel of the PARNI sampler is marginalised over all possible k

$$\begin{aligned} P^{\text {PARNI}(*)-\bullet }_{\theta }(\gamma , S) = \sum _{k \in \mathcal {K}} P^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , S) \end{aligned}$$
(40)

where the sub-transition kernels given k are

$$\begin{aligned} P^{\text {PARNI}(*)-\bullet }_{(\theta , k)}(\gamma , S)&= \sum _{\gamma ^\prime \in S} p^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime ) \nonumber \\&= \sum _{\gamma ^\prime \in S} \psi ^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime ) \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) \nonumber \\&\quad + \mathbb {I}\{\gamma \in S\} \sum _{\gamma ^\prime \in \Gamma } \psi ^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime )(1- \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )) \end{aligned}$$
(41)

and \(\alpha ^{\text {PARNI}}_{\theta , k}\) are Metropolis-Hastings acceptance rates in (31). The Markov chain transition kernel that works on the product space \(\Gamma ^{\otimes L}\) is given as

$$\begin{aligned} P^{\text {PARNI}(*)-\bullet }_{(\theta , k^{\otimes L})}(\gamma ^{\otimes L}, S^{\otimes L}) = \prod _{l=1}^L P^{\text {PARNI}(*)-\bullet }_{(\theta , k^{l})}(\gamma ^{l}, S^{l}). \end{aligned}$$
(42)

To establish the ergodicity and a SLLN of the PARNI sampler and its multiple chain acceleration, we require the following assumptions:

  1. (A.1)

    The weighting function \(g:\mathbb {R}^+ \rightarrow \mathbb {R}^+\) is \(C_g\)-Lipschitz. That is to say for any \(t_2> t_1 > 0\), there exists a constant \(C_g\) such that the weighting function g satisfies

    $$\begin{aligned} |g(t_2) - g(t_1)| \le C_g |t_2 - t_1|. \end{aligned}$$
    (43)

    The thresholded function of LIT clearly satisfies this assumption. This is also a common condition for the proper choice of balancing functions. For example, Hastings’ choice \(g_\text {H}(t) = \min \{1,t\}\) follows (45) immediately for \(C_g = 1\) and Barker’s choice \(g_\text {B} = t/(1+t)\) also follows (45) when \(C_g = 1\) (i.e. the maximum derivative).

  2. (A.1.a)

    Given a small positive real number c that is the universal minimum value of the ratio below

    $$\begin{aligned} \frac{\pi (\gamma ^\prime )p_\eta ^{\text {RN}}(k|\gamma ^\prime )}{\pi (\gamma )p_\eta ^{\text {RN}}(k|\gamma )} \end{aligned}$$
    (44)

    for all \(\gamma \), \(\gamma ^\prime \in \Gamma \) and \(k \in \mathcal {K}\) and \(\eta \in \Delta ^p_\epsilon = (\epsilon ,1-\epsilon )^p\), the weighting function \(g:(c,\infty ) \rightarrow (c,\infty )\) is \(C_g\)-Lipschitz. That is to say for any \(t_2> t_1> c > 0\), there exists a constant \(C_g\) such that the weighting function g satisfies

    $$\begin{aligned} |g(t_2) - g(t_1)| \le C_g |t_2 - t_1|. \end{aligned}$$
    (45)

    The square root function \(g_{\text {sq}}(t) = \sqrt{t}\) satisfy this condition by setting \(C_g\) to be \(c^{-1/2}/2\).

  3. (A.2)

    The posterior distribution \(\pi \) is everywhere positive and bounded, that is, there exists a positive \(\Pi \in (1, \infty )\) such that

    $$\begin{aligned} \frac{1}{\Pi } \le \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} \le \Pi \end{aligned}$$

    for all \(\gamma \), \(\gamma ^\prime \in \Gamma \).

  4. (A.3)

    Recall the interval \(\Delta _\epsilon ^{2p+1} = (\epsilon , 1-\epsilon )^{2p+1}\), the tuning parameters \(\theta ^{(i)} = (\eta ^{(i)}, \omega ^{(i)})\) are bounded away from 0 and 1, and lie in this interval

    $$\begin{aligned} \theta ^{(i)} \in \Delta _\epsilon ^{2p+1} \end{aligned}$$
    (46)

    for some small \(\epsilon \in (0, 1/2)\).

The analysis of convergence and ergodicity often relies on the distribution of the Markov chain at time i along with its associated total variation distance \(\Vert \cdot \Vert _{TV}\) at an arbitrary starting point. Given \(\{\gamma ^{l,(i)}\}_{i=0}^\infty \) these are defined as

$$\begin{aligned}&\mathcal {L}^{l,(i)} [(\gamma ^{l}, \theta ), S] := \Pr \left[ \gamma ^{l, (i)} \in S| \gamma ^{l, (0)} = \gamma ^{l}, \theta ^{0} = \theta \right] , \end{aligned}$$
(47)
$$\begin{aligned}&\lim _{i \rightarrow \infty } T^l(\gamma ^l, \theta , i) := \Vert \mathcal {L}^{l,(i)} [(\gamma ^{l}, \theta ), \cdot ] - \pi (\cdot ) \Vert _{TV}. \end{aligned}$$
(48)

We show here that the PARNI sampler is ergodic and satisfies a strong law of large numbers (SLLN). In mathematical terms for any starting point \(\gamma ^{\otimes L} \in \Gamma ^{\otimes L}\) and \(\theta \in \Delta _\epsilon ^{2p+1}\) ergodicity means that

$$\begin{aligned} \lim _{i \rightarrow \infty } T^l(\gamma ^l, \theta , i) \rightarrow 0, \quad \text {} \end{aligned}$$
(49)

for any \(l = 1, \ldots , L\), while a strong law of large numbers (SLLN) implies that

$$\begin{aligned} \frac{1}{NL} \sum _{i=0}^{N-1} \sum _{l=1}^L f(\gamma ^{l,(i)}) \rightarrow \pi (f) \end{aligned}$$
(50)

almost surely, for any \(f:\Gamma \rightarrow \mathbb {R}\). We first establish two technical results before presenting the main theorem of this section.

Lemma 1

(Simultaneous Uniform Ergodicity) The MCMC transition kernel \(P_\theta ^{\text {PARNI}(*)-\bullet }\) in (40) with target distribution \(\pi \) in (37) is simultaneously uniformly ergodic for any choice of \(\epsilon \in (0,1/2)\) in (46). i.e. for any \(\delta >0\), there exists \(N = N(\delta , \epsilon )\) such that

$$\begin{aligned} \left\| \left( P^{\text {PARNI}(*)-\bullet }_\theta (\gamma ^{\otimes L},\cdot )\right) ^N - \pi ^{\otimes L}(\cdot )\right\| _{TV} \le \delta \end{aligned}$$

holds for any any starting point \(\gamma ^{\otimes L} \in \Gamma ^{\otimes L}\) and any value \(\theta \in \Delta _\epsilon ^{2p+1}\).

Lemma 2

(Diminishing adaptation) Let the constant of adaptation rate \(\lambda \) be in (1/2, 1) for \(\bullet = \text {RM}\) and be 1/2 for \(\bullet = \text {KW}\), for any \(\epsilon \in (0, 1/2)\) and \(\pi _0 \in (0, 1)\), the PARNI sampler satisfies diminishing adaptation, that is, its transition kernel satisfies

$$\begin{aligned} \sup _{\gamma \in \Gamma } \left\| P^{\text {PARNI}(*)-\bullet }_{\theta ^{(i+1)}}(\gamma , \cdot ) - P^{\text {PARNI}(*)-\bullet }_{\theta ^{(i)}}(\gamma , \cdot ) \right\| _{TV} \le C i^{-\lambda } \end{aligned}$$
(51)

for some constant \(C < \infty \).

Theorem 2

(Ergodicity and SLLN) Consider a target distribution \(\pi (\gamma )\) in (37), constant of adaptation rate \(\lambda \in (1/2,1)\) for \(\bullet = \text {RM}\) and \(\lambda = 1/2\) for \(\bullet = \text {KW}\) and \(\epsilon \in (0,1/2)\) that lead to a adaptation rate \(\mathcal {O}(i^{-\lambda })\), and the parameter \(\pi _0 > 0\) in Algorithm 3. Then ergodicity (49) and a strong law of large numbers (50) hold for all the \(\text {PARNI(T)-KW}\), \(\text {PARNI(T)-RM}\), \(\text {PARNI(B)-KW}\) and \(\text {PARNI(B)-RM}\) samplers as described in Algorithm 3 and its corresponding multiple chain acceleration versions.

Fig. 2
figure 2

Simulated data: trace plots of log posterior model probability from the Add-Delete-Swap (ADS), Adaptively Scaled Individual (ASI) adaptation, Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) samplers for the first 1500 iterations on simulated datasets with signal-to-noise ratio of 2

Table 1 Simulated data: relative average mean squared errors for the Adaptively scaled individual (ASI), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) schemes on estimating posterior inclusion probabilities over important and unimportant variables respectively against a standard Add-Delete-Swap algorithm

5 Numerical studies

5.1 Simulated data

We consider the data generation model introduced by Yang et al. (2016), and replicated in simulation studies conducted by Griffin et al. (2021) and Zanella and Roberts (2019). Suppose a linear model with n observations and p covariates is needed, data are generated from the model specification

$$\begin{aligned} y = X^* \beta ^* + \epsilon \end{aligned}$$

where \(\epsilon \sim N_n(0, \sigma ^2 I_n)\) for pre-specified residual variance \(\sigma ^2\) and \(\beta ^* = \text {SNR}\times \tilde{\beta }\sqrt{(\sigma ^2\log p)/n}\) in which \(\text {SNR}\) represents the signal-to-noise ratios. Let \(\tilde{\beta } = (2, -3, 2,2,-3,3,-2,3,-2,3,0,\cdots ,0)\) and each row of the design matrix \(X^*_i\) follow a multivariate normal distribution with mean zero and covariance \(\Sigma \) with entries \(\Sigma _{j j} = 1\) for all j and \(\Sigma _{ij} = 0.6^{|i-j|}\) for \(i \ne j\). We consider four choices of SNR, namely 0.5, 1, 2 and 3, two choices of n, namely 500 and 1,000 and three choices of p, namely 500, 5000 and 50,000.

We use the same prior parameter values \(V_\gamma = I_{p_\gamma }\), \(g = 9\) and \(h = 10/p\) as specified in Griffin et al. (2021). In the same work, a detailed description of the resulting posterior distributions is given. In the presence of a low SNR (\(\text {SNR} = 0.5\)), there is too much noise to detect the true non-zero variables and the resulting posterior is rather flat, with no variables having posterior inclusion probabilities larger than 0.1. The posterior distributions are completely different when the SNR is large (\(\text {SNR} = 2\) and \(\text {SNR} = 3\)). In these cases all of the true non-zero variables have inclusion probabilities close to 1 as the posterior distributions are more concentrated. In the intermediate case \(\text {SNR} = 1\) slightly less than half of the true non-zero variables have inclusion probabilities above 0.8. In general the problem of finding the true non-zero variables becomes more difficult in the cases with lower SNR, smaller n and larger p.

We are interested in comparing the performance of the ASI and PARNI schemes relative to an ADS sampler because the ASI scheme has been compared with several other state-of-the-art MCMC algorithms in Griffin et al. (2021). The adaptive algorithms are run with 25 multiple chains. The first third of the chain are identified as the period of burn-in. In addition, to reduce the computational budget, all the adaptations terminate after the period of burn-in.

Trace plots of chains are a straightforward way to visualise convergence. Figure 2 are the trace plots of posterior model probabilities from the ADS, ASI, PARNIT-KW, PARNIT-RM, PARNIB-KW and PARNIB-RM algorithms for the first 1500 iterations when the SNR = 2. The ADS scheme fails to converge for all choices of n and p and in particular becomes trapped at areas around the null model (i.e. the empty model) for a long period of time when \(p=50,000\). The ASI scheme converges reasonably quickly when p is 500 or 5000, but takes longer to reach high probability regions when \(p = 50,000\). This suggests that ASI mixes worse and converges slower in high-dimensional data-sets. On the other hand, all the PARNI samplers mix rapidly in this setting for which they only take several moves to converge properly.

The trace plots are not truly a fair comparison as they do not take into account running time. To better address the issue of computational efficiency we ran all of the algorithms for 3 repetitions. Each individual chain was run for 15 min and we stored the estimates of posterior inclusion probabilities. We calculated mean squared errors of these estimates compared to “gold standard” estimates taken from a weighted tempered Gibbs sampler that was run for roughly 12 h. We show results in the form of performance relative to the ADS scheme in Table 1. Smaller values always indicate better performance of the scheme. The value of \(-1\) indicates the scheme yields 10 times smaller mean squared errors compared to those from the ADS scheme in this specific data-set. Generally speaking, the mean squared errors for important variables are greater than those for important variables for almost every data-set and scheme. The choice of n does not significantly affect the performance of the samplers. Concentrating on the results for important variables, the ASI scheme leads to an order of magnitude improvement in efficiency over the ADS sampler, which match the results in Griffin et al. (2021). The four PARNI algorithms with different weighting functions and adaptations lead to similar levels of accuracy and dominate both the ASI and ADS schemes in every case except \(p=500\). In particular, the PARNI schemes result in roughly \(10^5\) times improvements over ADS and more than 10 times improvements over ASI when \(p=50,000\) and SNR\( = 2\). On the other hand, the ADS scheme is quite adept at removing the unimportant variables when the true model size is small compared to the number of covariates. When \(p = 50,000\) and SNR\(>1\) the ASI scheme struggles with unimportant variables and leads to worse estimates than ADS, but the PARNI algorithms produce better estimates even for these unimportant variables. Overall, the results suggest PARNI samplers are more computationally efficient than alternatives when p is large. More results from simulated data are provided in Section C.3 of the supplementary material.

Table 2 Table of prior specifications of 8 real dataset
Fig. 3
figure 3

Real data: plots of expected squared jumping distance and average mean square error again average acceptance rate and \(\omega \) for the Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM). a average acceptance rate against \(\omega \) for 4 small-p real datasets; b average acceptance rate against \(\omega \) for 4 large-p real datasets; c expected squared jumping distance against average acceptance rate for 4 small-p real datasets; d expected squared jumping distance against average acceptance rate for 4 large-p real datasets; e average mean squared error against average acceptance rate for 4 small-p real datasets; f average mean squared error against average acceptance rate for 4 large-p real datasets

Fig. 4
figure 4

Real data: plots of expected squared jumping distance and average mean square error again average acceptance rate and \(\omega \) for the Pointwise implementation of Adaptive Random Neighbourhood Informed and Balanced proposal with Robbins-Monro update (PARNIB-RM). a average acceptance rate against \(\omega \) for 4 small-p real datasets; b average acceptance rate against \(\omega \) for 4 large-p real datasets; c expected squared jumping distance against average acceptance rate for 4 small-p real datasets; d expected squared jumping distance against average acceptance rate for 4 large-p real datasets; e average mean squared error against average acceptance rate for 4 small-p real datasets; f average mean squared error against average acceptance rate for 4 large-p real datasets

Fig. 5
figure 5

Real data: trace plots of log posterior model probability from the Add-Delete-Swap (ADS), Adaptively Scaled Individual (ASI) adaptation, Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) samplers for the first 1500 iterations on 4 moderate-p datasets

Fig. 6
figure 6

Real data: trace plots of log posterior model probability from the Add-Delete-Swap (ADS), Adaptively Scaled Individual (ASI) adaptation, Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) samplers for the first 1500 iterations on 4 large-p real datasets

5.2 Real data

We consider eight real data-sets implemented in Griffin et al. (2021), four of them with moderate p and four with larger p.

The first data-set is the Tecator data-set, which is previously analysed by Brown and Griffin (2010) in Bayesian linear regression and implemented by Lamnisos et al. (2013) and Griffin et al. (2021) in the context of Bayesian variable selection. It contains 172 observations and 100 explanatory variables. We also consider three small p data sets constructed by Schäfer and Chopin (2013) to illustrate the performance of sequential Monte Carlo algorithms on Bayesian variable selection problems, the Boston Housing data \((n = 506, p=104)\), the Concrete data \((n = 1030, p = 79)\) and the Protein data \((n = 96, p = 88)\). These data sets are extended by squared and interaction terms which lead to high dependencies and multicollinearity.

The last four data sets are high-dimensional problems with very large p. Three of them come from an experiment conducted by Lan et al. (2006) to examine the genetics of two inbred mouse populations. The experiment resulted in a set of data with 60 observations in total that were used to monitor the expression levels of 22, 575 genes of 31 female and 29 male mice. Bondell and Reich (2012) first considered this data-set in the context of variable selection. Three physiological phenotypes are also measured by quantitative real-time polymerase chain reaction (PCR), they are used as possible responses and are named \(\text {PCR}i\) for \(i=1,2,3\) respectively. For more details, see Lan et al. (2006); Bondell and Reich (2012). The last data-set concerns genome-wide mapping of a complex trait. The data are illustrated in Carbonetto et al. (2017). They are body and testis weight measurements recorded for 993 outbred mice, and genotypes at 79,748 single nucleotide polymorphisms (SNPs) for the same mice. The main purpose of the study is to identify genetic variants contributing to variation in testis weight. Thus, we consider the testis weight as response, the body weight as a regressor that is always included in the model and variable selection is performed on the 79,748 SNPs.

Before analysing the performance of MCMC algorithms on the above data-sets, it is worth discussing the selection of an optimal acceptance rate for the PARNI-RM sampler. The optimal scaling property of a Gaussian random walk proposal on some specific forms of target distribution is a well-studied problem. The most commonly used guideline is to seek an average acceptance rate of 0.234 (Gelman et al. 1997). The optimal acceptance rates for sophisticated informed proposals involving gradient information are typically larger, e.g. 0.57 for the Metropolis-adjusted Langevin algorithm (Grenander and Miller 1994; Roberts and Rosenthal 1998) and 0.65 for Hamiltonian Monte Carlo (Duane et al. 1987; Beskos et al. 2013). As our balanced random neighbourhood proposals can be viewed as a discrete analog to these gradient-based algorithms, it is natural to think that the PARNI samplers will have a larger optimal acceptance rate than a random walk Metropolis. To test this, we ran the PARNIT-RM and PARNIB-RM schemes targeting different rates of acceptance on the above data-sets. Figures 3 and 4 show the effect of the average acceptance rate on the expected squared jumping distance and average mean squared errors of these two schemes respectively. Both of the figures imply the same conclusions. Parts (a) and (b) of the figure illustrate the relation between the thinning parameter \(\omega \) and the average acceptance rate. Bigger values of \(\omega \) are synonymous with larger jumps and therefore can lead to a smaller average acceptance rate. Parts (c) and (d) of the figure suggest that the maximum average squared jumping distance occurs when the acceptance rate is around 0.65 for all data-sets. Parts (e) and (f) show that the average mean squared error is minimised when the average acceptance rate is around a similar region. Therefore, for problems we have looked at, targeting an average acceptance rate of 0.65 does not perform badly. Similar results for the simulated data-sets of 5.1 are presented in C.1 of the supplementary material. We stress that the PARNIT-KW and PARNIB-KW schemes does not require a target acceptance rate to be chosen, so users who are uncomfortable with having to choose this quantity for a particular data-set are recommended to use this version of the sampler.

We consider a total of ten different MCMC schemes for these sets of data. In addition to the six schemes used in the simulation study (ADS, ASI, PARNIT-KW, PARNIT-RM, PARNIB-KW and PARNIB-RM), we also implement four state-of-the-art algorithms, the Hamming ball sampler (HBS) with radius of 1 of Titsias and Yau (2017), both the tempered Gibbs sampler (TGS) and weighted tempered Gibbs sampler (WTGS) of Zanella and Roberts (2019), and also the Locally Informed and Thresholded (LIT) scheme by Zhou et al. (2021) (which uses same weighting function as the LIT-MH-1 scheme in their paper). All algorithms are run for the same amount of time and compared using average mean squared errors. Only the adaptive schemes are run with 25 multiple shorter chains while other schemes use a single longer chain. The prior specification for each data-set is given in Table 2 (Table table:realspsmse).

Figures 5 and 6 show trace plots of posterior model probabilities from the ADS, ASI, PARNIT-KW, PARNIT-RM, PARNIB-KW and PARNIB-RM algorithms for the first 1500 iterations in all eight real data-sets. Overall, the PARNI algorithms perform better than the ADS and ASI schemes in both convergence and mixing. It is clear that the ADS scheme does not mix well since it struggles to explore model space. All algorithms do reach high-probability regions for data-sets with moderate p in roughly the same number of iterations, however the PARNI schemes can reach these high-probability regions faster and accept more jumps inside the model space. In the large p data-sets, these algorithms lead to different behaviour. The ADS scheme gets trapped in the null model and only proposes models around it and the ASI algorithm does not converge properly for the first 1500 iterations either. The PARNI schemes, by contrast, accept almost every proposed states and mix very quickly. They are able to propose and accept models with relatively low posterior probabilities and explore the sample space efficiently.

We next turn attention to the average mean squared errors on these eight real data-sets. These results are shown in Table 3. In moderate p data-sets, the PARNI samplers do not dominate other schemes, but they still lead to good results. However, PARNI performs worst for the Boston Housing and Concrete data-sets, which are multi-modal and contain intricately correlated covariates. This implies that the point-wise sub-proposals of PARNI can become trapped at isolated local modes. The ADS scheme performs well in terms of computational efficiency for the Tecator and Concrete data-sets due to a convenient computational implementation which has the cheapest computational costs among competing schemes. Due to the dimension-free mixing property, the LIT scheme outperforms ADS except on the Tecator data-set where all covariates carry non-negligible weights and all covariates are therefore in the potentially influential subset S of \(\Gamma \) (see Sect. 2.3 of Zhou et al. (2021) for more detail). For large p problems all the PARNI schemes significantly outperform other samplers. Surprisingly, the HBS and TGS schemes lead to worse estimates than ADS. This can be explained by the computational cost per iteration of the HBS, TGS and WTGS algorithms, which is linear in p. The combination of these large computational costs and the issue of rarely exploring important variables lead to low efficiencies for HBS and TGS. The WTGS algorithm still outperforms TGS, which coincides with the conclusions gathered in Zanella and Roberts (2019) where the WTGS algorithm is shown to have smaller relaxation time than TGS. The ASI algorithm gives competitive estimates to WTGS in high-dimension but is eventually dominated by the PARNI schemes. The LIT scheme leads to better results in the SNP data-set (\(n = 993\)) but not in PCR data-sets (\(n=60\)) since the dimension-free mixing of LIT only holds when n is comparatively large. And it yields larger average mean squared errors than the PARNI samplers in all large p data-sets because the ADS type neighbourhoods of the LIT scheme only contains models with at most 2 changes in Hamming distance and the jumping distance of LIT therefore is bounded by 2 whereas PARNI can potentially propose larger jumps. Among the PARNI schemes, the two weighting schemes (thresholding and balancing function) have a similar level of efficiency. Specifically, using the thresholding function estimates the posterior inclusion probabilities with lower relative average mean square errors than the balancing function in all four moderate p data-sets and the PCR1 data-set. The performance of all PARNI schemes is similar for the SNP data-set and outperforms that of competitors. In terms of the adaptation schemes, the PARNI sampler with Kiefer–Wolfowitz adaption generally performs better than the Robbins-Monro version, but only by a small margin. This is due to the fact that the optimal acceptance rates are problem-specific and not exactly 0.65 for every data-set.

Table 3 Real data: relative average mean squared errors for the Adaptively Scaled Individual (ASI), Hamming Ball Sampler (HBS), Tempered Gibbs Sampler (TGS), Weighted Tempered Gibbs Sampler (WTGS), Locally Informed and Thresholded (LIT), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) schemes on estimating posterior inclusion probabilities over important variables against a standard Add-Delete-Swap algorithm

6 Discussion and future work

In this paper we present a framework for neighbourhood based MCMC algorithms, and propose a new scheme as an informed counterpart to the ASI algorithm in Griffin et al. (2021), using elements from locally informed Metropolis-Hastings introduced in Zanella (2020) and Zhou et al. (2021). To address the expensive computational costs introduced by the informed proposal, we introduce two less computationally costly algorithms, the PARNI schemes, which can lead to a dramatic improvement in computational efficiency. In addition, we offer two options of informed weighting functions, the thresholding function and balancing function. The PARNI schemes also allow two different adaptation schemes, the Kiefer–Wolfowitz and Robbins-Monro schemes. The numerical results from Sect. 5 support the power of the algorithmic structure of PARNI. The success of these new schemes is attributed to two aspects. Firstly the adaptation helps to explore the areas of interest (mainly with high posterior probabilities), and secondly the locally informed proposals are able to stabilise random walk behaviour in high-dimensions and lead to rapidly mixing samplers in practice. From the numerical studies on both simulated and real data-sets, we recommend using a PARNI sampler with the Kiefer–Wolfowitz scheme for tackling high-dimensional (or large p) Bayesian variable selection problems. We note that it can still be challenging for the PARNI samplers to move across low probabilistic regions, which could affect performance when the posterior has very isolated modes. This phenomenon is due to the fact that the PARNI samplers propose models sequentially where each sub-proposal can alter only 1 position at most. On the other hand, the original ARNI scheme can take larger jumps and is more able to explore well-separated modes, albeit with a substantial increase in computational costs. In summary, new schemes like PARNI show the potential of combining adaptive, random neighbourhood and informed proposals. We look forward to adding more theoretical support to the numerical evidence shown here in future work. In addition, the code to run the PARNI samplers and aforementioned numerical studies can be downloaded from https://github.com/XitongLiang/The-PARNI-scheme.git.

There are many directions for extensions and future work. Some recent work has shed light on the issues of extra computational costs that come with informed proposals. Grathwohl et al. (2021) develop an accelerated locally informed proposal that uses derivatives with respect to the log mass functions. It is possible to derive the gradient of the posterior mass function with respect to \(\gamma \) with minor modifications to representations of the posterior distribution \(\pi (\gamma )\). To address the lack of mode jumping in the PARNI schemes, we can first try to construct larger blocks intelligently so that separated models are covered in one single block. This solution can be achieved by introducing basis vectors beyond the Cartesian case in the block construction. One can also use the sequential Monte Carlo methods of Schäfer and Chopin (2013) and Ma (2015), which are more able to handle multimodality. Combining them with PARNI yields the chance of producing efficient methods on highly multimodal posterior distributions with well-separated modes. Another option in this direction is the JAMS algorithm of Pompe et al. (2020) that first locates each individual mode and then produces a mixture proposal that involves jumps within and between modes.

We also intend to study the performance of the PARNI schemes in generalised linear models as in Wan and Griffin (2021) or a more flexible Bayesian variable selection model such as that suggested by Rossell and Rubio (2018). In these cases, regression coefficients and residual variance are no longer integrated out analytically and the likelihood of \(\gamma \) is not available in closed form. Informed proposals for such models are computationally challenging because the proposals involve the evaluations of these likelihood but the required approximations and estimates of the marginal likelihood are computationally intensive. One possible approach is the data-augmentation method using the Pólya-gamma distribution as described in Polson et al. (2013). The design does however require some care to avoid inefficiency causing by introducing a large number of auxiliary variables in large n problems. We also believe that random neighbourhood samplers can be used beyond variable selection, and aim to consider applications to other discrete-valued sampling problems in future work.