Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable selection

We introduce a framework for efficient Markov chain Monte Carlo algorithms targeting discrete-valued high-dimensional distributions, such as posterior distributions in Bayesian variable selection problems. We show that many recently introduced algorithms, such as the locally informed sampler of Zanella (J Am Stat Assoc 115(530):852–865, 2020), the locally informed with thresholded proposal of Zhou et al. (Dimension-free mixing for high-dimensional Bayesian variable selection, 2021) and the adaptively scaled individual adaptation sampler of Griffin et al. (Biometrika 108(1):53–69, 2021), can be viewed as particular cases within the framework. We then describe a novel algorithm, the adaptive random neighbourhood informed sampler, which combines ideas from these existing approaches. We show using several examples of both real and simulated data-sets that a computationally efficient point-wise implementation (PARNI) provides more reliable inferences on a range of variable selection problems, particularly in the very large p setting.


Introduction
Despite their long history, linear regression models remain a key building block of many present-day statistical analyses. In the modern setting, practitioners not only show interest in making good predictions but also intend to investigate underlying low-dimensional structure based on the belief that only a small subset of predictors play a crucial role in predicting the response. These problems can be addressed by variable selection. A variable selection method is an automatic procedure that selects the best (small) subset of covariates that explains most of the variation in the response (Chipman et al., 2001). Frequentist approaches focus on model comparisons through information criteria or point estimates, using e.g. maximum penalised likelihood under sparsity assumptions (Hastie et al., 2015). Alternatively the Bayesian approach can be taken by imposing an appropriate prior on all possible models and computing the posterior.
We consider Bayesian variable selection (BVS) with spike-and-slab priors (Mitchell and Beauchamp, 1988), which lead to natural uncertainty measures such as posterior model probabilities and marginal posterior variable inclusion probabilities. For a linear regression model with p candidate covariates, it has been shown that spike-and-slab priors often lead to posterior consistency in the sense that the posterior collapses to a Dirac measure on the true model as more observations are gathered (Fernandez et al., 2001;Liang et al., 2008;Yang et al., 2016), even in high-dimensional setting where p grows with n (Shang and Clayton, 2011;Narisetty and He, 2014). Another approach is to employ continuous shrinkage priors (e.g. Polson and Scott, 2010;Griffin and Brown, 2021), which only give posterior inference on regression coefficients but can result in a more computationally tractable posterior distribution.
The exact posterior distribution when using a spike-and-slab prior is challenging to compute, and when p > 30 Markov chain Monte Carlo (MCMC) algorithms are typically used to estimate posterior summaries of interest (George and McCulloch, 1993;Chipman et al., 2001). Garcia-Donato and Martinez-Beneito (2013) discuss the properties of classical methods such as the Gibbs sampler in the context of variable selection. Other MCMC algorithms such as MC 3 (Madigan et al., 1995) and the Add-Delete-Swap method (Brown et al., 1998) are also well-studied. Fernandez et al. (2001), for instance, use the MC 3 sampler to explore the benchmark hyperparameter in a g-prior. Yang et al. (2016) use Add-Delete-Swap to provide the conditions for rapid mixing in the sense that the mixing time grows at most polynomially in p. These approaches do, however, suffer from an unexpectedly long mixing time and therefore slow convergence when p is large. For this reason, alternative approaches have also been developed. Hans et al. (2007) describe a novel Shotgun Stochastic Search (SSS) algorithm to identify a subset of probable models and then approximate the posterior distribution by restricting calculations to this subset. Papaspiliopoulos and Rossell (2017) propose an efficient deterministic iterative procedure of model search under block-diagonal design. Ray and Szabó (2021) introduce a mean-field spike and slab variational Bayes approximation to the BVS posterior, and show that such an approximation converges to the sparse truth at the optimal rate and gives optimal prediction of the response vector under mild conditions. MCMC methods for variable selection can be improved using adaptive proposals within a Metropolis-Hastings algorithm (Metropolis et al., 1953;Hastings, 1970). Adaptive MCMC is a sub-class of algorithms in which tuning parameters are automatically updated "on the fly" (e.g. Andrieu and Thoms, 2008). Several adaptive methods have been developed in the context of BVS. Ji and Schmidler (2013) suggest using a multivariate Gaussian proposal kernel for the regression coefficient in which the mean and covariance matrix are learned as the algorithm runs. Lamnisos et al. (2009) and Lamnisos et al. (2013) develop adaptive Gibbs, MC 3 and Add-Delete-Swap samplers. More recently  developed the Adaptively-Scaled Individual Adaptation sampler (ASI), which is able to adapt the importance of each individual predictor variable and propose multiple swaps per iteration in high-dimensional settings. The algorithm works by learning the posterior inclusion probabilities for each covariate adaptively in a Rao-Blackwellised manner, and using these probabilities to construct a Metropolis-Hasting proposal under the assumption of an orthogonal design matrix.
Recently informed MCMC schemes have gained popularity for problems with discrete parameter spaces (having already achieved prominence in the continuous setting). Informed MCMC schemes are those in which the Metropolis-Hastings proposal exploits some information about the posterior distribution. Intuitively, the success of informed proposals relies on avoiding models with low posterior model probabilities (Zhou et al., 2021). Titsias and Yau (2017) propose the Hamming ball sampler (HBS) in which models are proposed in proportion to their locally-truncated posterior probability within a Hamming ball neighbourhood. Zanella and Roberts (2019) consider a Tempered Gibbs sampler (TGS), which involves importance sampling and more frequently updates components with lower conditional distributions. A more general class of locally informed proposals is introduced by Zanella (2020). These locally informed proposals can be obtained by weighting a base kernel using a balancing function, which is a function of the posterior distribution that satisfies a certain functional property. The base kernel is typically concentrated on a neighbourhood of the current state, resulting in a proposal that is informed or balanced using "local" information about the posterior. The author shows that a random walk proposal is asymptotically dominated by its locally balanced counterpart in the Peskun sense as dimensionality increases (Peskun, 1973;Tierney, 1998). It has also been shown by Zhou et al. (2021) that a dimension-free mixing time bound can be constructed for a properly-designed locally informed MCMC algorithm for BVS. For other developments concerning locally informed proposals, see e.g. Livingstone and Zanella (2019); Gagnon (2021); Power and Goldman (2019).
Informed MCMC schemes often mix quickly and have good convergence properties, but the computation of each transition can be prohibitively expensive. For example, in Bayesian variable selection, informed proposals usually involve calculating multiple posterior model probabilities within a neighbourhood of the current model (such as all models within Hamming distance one). The size of this neighbourhood will be at least linear in p and will tend to include large numbers of unimportant variables under standard sparsity assumptions. In this paper we propose a solution to this problem by introducing a framework for constructing flexible and efficient MCMC algorithms based on locally balanced proposals within random neighbourhoods. In large p settings, this randomness can be exploited to control the size and the number of unimportant variables in generated neighbourhood. This leads to Markov chains with good convergence properties and controlled computational cost per iteration. To illustrate the power of the framework we develop a new MCMC algorithm for Bayesian variable selection in linear regression, namely the Point-wise Adaptive Random Neighbourhood Informed (PARNI) sampler. We illustrate with extensive empirical results on both real and simulated datasets that the PARNI sampler yields good estimates for posterior quantities of interest and performs particularly well for well-known large-p examples such as the PCR (p = 22, 575) and SNP (p = 79, 748) datasets. Random neighbourhoods have also been considered previously in the context of stochastic search (Chen et al., 2016).
The rest of this paper is structured as follows. In Section 2, we review BVS for the linear model along with prior specification. We also briefly describe both the ASI scheme of  and the locally informed method of Zanella (2020). In Section 3, we characterise a construction of random neighbourhood proposals and illustrate that locally informed proposal and the ASI scheme fall in this framework. Section 4 presents the construction of adaptive random neighbourhood and informed samplers. Following this structure, we present the ARNI and PARNI samplers. In addition, we establish both the ergodicity and a strong law of large numbers for the PARNI algorithm. We implement the PARNI sampler in Section 5 on both simulated and real data. Comparisons between the PARNI sampler and other state-of-the-art MCMC algorithms are carried out to showcase its capacity and efficiency. In Section 6 we discuss limitations and possible future work. Detailed explanations and proofs are provided in the supplement.

Bayesian variable selection for the linear regression model
Consider a dataset {(yi, xi1, ..., xip)} n i=1 , where the vector y = (y1, ..., yn) ∈ R n is called the response variable and each xj = (x1j, ..., xnj) is one of p predictor variables or covariates. The variable selection problem is concerned with finding the best q p covariates that are most associated with the response. Assuming that each regression includes an intercept, then there are 2 p possible models that can be formulated to predict the response. We refer to each model as Mγ where the models are indexed by the indicator variable γ = (γ1, . . . , γp) ∈ Γ = {0, 1} p , where γj = 1 if the j-th variable is included in model Mγ and γj = 0 otherwise. We refer to Γ as model space and let pγ := j γj. The model Mγ associated with γ is then where ∼ Nn(0, σ 2 In), y is an n-dimensional response vector, Xγ is an (n × pγ) design matrix which consists of the "active" variables in γ (those for which γj = 1), α is an intercept term and βγ ∈ R pγ . In the Bayesian framework, we consider a commonly-used conjugate prior specification For simplicity, we can remove the intercept term α by centering y and Xj for all j. Chipman et al. (2001) highlight that this method can be motivated from a formal Bayesian perspective by integrating out the coefficients corresponding to those fixed regressors with respect to an improper uniform prior. The covariance matrix Vγ is often chosen as (X T γ Xγ) −1 (a g-prior) or identity matrix Ip γ (an independent prior). For both of these choices, the marginal likelihood p(y|γ) is analytically tractable. Suitable values for the global scale parameter g are suggested in Fernandez et al. (2001). It can also be driven by a hyperprior, yielding a fully Bayesian model (see Liang et al. (2008) for details). The hyperparameter h ∈ (0, 1) is the prior probability that each variable is included in the model. Steel and Ley (2007) suggest against using fixed h unless strong information is given, and instead placing a hyperprior on it such as a Beta prior h ∼ Beta(a, b), leading to a Beta-binomial prior on the model size. In the following sections, we will develop efficient sampling schemes targeting the posterior distribution π(γ) ∝ p(y|γ)p(γ).  introduce a scalable adaptive MCMC algorithm targeting high-dimensional BVS posterior distributions together with a method that automatically updates the tuning parameters. They consider the class of proposal kernels

Adaptively Scaled Individual Adaptation algorithm
for i = 1 to i = N do for l = 1 to l = L do Sample γ l, ∼ q ζ (i) η (i) (γ l,(i) , ·) as in (2) and U ∼ U (0, 1); (3), then γ l,(i+1) = γ l, , else γ l,(i+1) = γ l,(i) ; end for Updateπ (i+1) j as in (5), setπ Algorithm 1 Adaptively Scaled Individual Adaptation (ASI) where η = (A, D) = (A1, . . . , Ap, D1, . . . , Dp), qη,j(γj = 0, γ j = 1) = Aj and qη,j(γj = 1, γ j = 0) = Dj, with Metropolis-Hastings acceptance probability This proposal mainly benefits from two aspects. Firstly, the flexibility offered by 2p tuning parameters allows the proposal to be tailored to the data. Secondly, this form of proposal also allows multiple variables to be added or deleted from the model in a single iteration, which in turn allows the algorithm to make large jumps in model space.  suggest an optimal choice of η = (A, D) in Peskun sense while assuming that all variables are independent. If πj denotes the posterior inclusion probability of the j-th regressor, the optimal choice of η opt = (A opt , D opt ) is given as The independence assumption is usually violated due to the correlation between regressors and therefore a scaled proposal with parameters η = ζη opt for a scaling parameter ζ ∈ (0, 1) is suggested. This scaling parameter ζ controls the number of variables that differ between the current state γ and the proposed state γ . Smaller values of η can be used to avoid overly ambitious moves with low probabilities of acceptance and so control the average acceptance rate. They also suggest multiple chain acceleration with common adaptive parameters since running multiple independent chains with shared adaptive parameters can facilitate the convergence of the adaptive parameters (Craiu et al., 2009). This phenomenon is guaranteed in their simulation studies where the schemes with 25 multiple chains outperform the schemes only with 5 multiple chains in terms of relative efficiency especially for large p datasets. Suppose L chains are used and let γ l,(i) and γ l, denote the current state and proposal respectively for the l-th chain. The tuning parameters of the proposal are updated on the fly using a Rao-Blackwellised estimate of the posterior inclusion probability of the j-th regressor which, at the N -th iteration, iŝ The use of the Rao-Blackwellised estimates of the posterior inclusion probabilities can swiftly distinguish unimportant variables.  show how this can be calculated in O(p) operations which leads to a scalable MCMC scheme in large-p BVS problems. At the i-th iteration, the proposal parameters are and the scaling parameter ζ (i) is tuned using the Robbins-Monro scheme for a target rate of acceptance τ and the mapping logit : ( , 1 − ) → R is a modified logistic function (or logit function) defined by for some small ∈ (0, 1/2). The full description of the sampler is given in Algorithm 1. The resulting algorithm is called Adaptively Scaled Individual Adaptation (ASI).  establish the π-ergodicity and a strong law of large numbers for the ASI sampler.

Locally informed proposals
In continuous sample space, MCMC algorithms often utilise gradients of the target distribution, e.g. the Metropolis-adjusted Langevin algorithm (Grenander and Miller, 1994) and Hamiltonian Monte Carlo (Duane et al., 1987). These methods are defined on continuous spaces but Zanella (2020) develop a class of locally informed proposals as an analog for discrete spaces. The approach assumes that we can define a random walk Metropolis proposal kernel Q on a neighbourhood N ⊂ Γ with mass function q. Zanella (2020) considers a class of pointwise informed proposals Qg defined as follows where g : [0, ∞) → [0, ∞) is a continuous function and Zg(γ) is a normalising constant such that The choice of the function g is crucial for the performance of Qg since it determines how the target distribution π drives the proposal. Locally balanced proposals are defined to be those proposals Qg which are approximately π-reversible if Q only proposes local moves. Zanella (2020) shows that this occurs if g(t) = tg(1/t) for all t > 0 and calls these balancing functions. The neighbourhoods are often chosen to be N = Hm(γ) := {γ ∈ Γ|dH (γ , γ) ≤ m} for which dH (·, ·) denotes the measure of Hamming distance (i.e. dH (γ, γ ) = p j=1 |γj −γ j |) and the proposal kernel Q would be a uniform distribution on the neighbourhood N . When m is taken to be 1, Q is identical to the MC 3 sampler.
Both analytical and empirical results suggest that a locally balanced proposal outperforms other alternatives such as the original non-informative kernel (when g(t) = 1) or the globally balanced kernel (when g(t) = t). Theorem 5 of Zanella (2020) shows that using a uniform based kernel on neighbourhood Hm(γ) will be asymptotically optimal, in terms of Peskun ordering, as the dimensionality goes to infinity under mild conditions.

Random neighbourhood samplers
We consider a framework for constructing Metropolis-Hastings proposals to sample from π(γ) in which a new state is proposed within a neighbourhood around the current state. The neighbourhoods can be random, and are generated using an auxiliary variable k as a neighbourhood indicator. The auxiliary variable k is a discrete random variable defined on a countable set K such that a neighbourhood N = N (γ, k) is generated as the same probability of generating k (i.e. p(N |γ) = p(k|γ)). Suppose γ is the current state and Q k is a Metropolis-Hastings proposal kernel (conditioned on k) with mass function q k . A new state γ is drawn from kernel Q k after a k is generated. In updating k at each iteration, we usually consider sending k to a new state k ∈ K depends on a bijection ρ : K → K which is an involution (i.e. a self-inverse function which satisfies ρ(ρ(k)) = k). We call an MCMC algorithm that uses the above construction to generate Metropolis-Hastings proposals a random neighbourhood sampler. The followings are some examples of random neighbourhood samplers.
Example 1 (Samplers with non-stochastic neighbourhoods). In fact, samplers with non-stochastic neighbourhoods are also random neighbourhood samplers where the specific neighbourhoods are generated with constant probability of 1 at each state γ. In such cases, the choices of k and ρ can be arbitrary. For instance, the MC 3 sampler can be viewed as a random neighbourhood sampler for which the neighbourhood N consists of models that are 1-Hamming distance from γ. In particular, the locally informed samplers of Zanella (2020) also belong to this class with neighbourhood N as defined in Section 2.3.
Example 2 (Add-Delete-Swap sampler). In each iteration of an Add-Delete-Swap sampler, a strategy from "addition", "deletion" and "swap" is uniformly chosen which implies that the auxiliary variable k is uniformed distributed over the sample space K = {"addition", "deletion", "swap"} and therefore construct a neighbourhood N (γ, k) as in Yang et al. (2016). A new state γ is uniformly proposed from N (γ, k). The corresponding mapping ρ is then a function that sends the auxiliary variable to an opposite strategy, e.g. it sends "addition" to "deletion" and vice versa. Note that the opposite strategy of "swap" is itself.
Example 3 (Hamming ball sampler). A Hamming ball sampler with radius m is described by Titsias and Yau (2017). This algorithm is literally evolved from a neighbourhood construction Hm(γ) ⊂ Γ which produces a subset of states at most m-Hamming distance away from γ. The auxiliary variable k is equivalent to U in their design in which k is uniformly distributed over the set K = Hm(γ) and a neighbourhood N (γ, k) = Hm(k) is used to draw new state. The Hamming ball sampler proposes a new state according to the truncated posterior model probability in the neighbourhood N (γ, k). In this scheme, the mapping ρ is an identity function which indicates the same auxiliary variable is used in reversed moves.
Throughout this article, we refer to the above three stages as neighbourhood construction, withinneighbourhood proposal and accept/reject step respectively. To preserve the reversibility of the chain, it is better to design a neighbourhood generation scheme where the law holds for any γ, γ and k. Upon this law, we assume that the condition is satisfied. This assumption is a generalisation of the paired-move strategy in Chen et al. (2016) and it results in the correctness and reversibility of such a scheme through the following proposition.
Proposition 1. A random neighbourhood sampler is π-reversible provided that condition (13) holds, p(k|γ) is a valid probability measure on K and q k (γ, γ ) is a valid probability measure on neighbourhood N (γ, k) for all γ ∈ Γ and k ∈ K.
Remark 1. To generalise the framework of random neighbourhood samplers, it is possible to use a continuous auxiliary variable k. In such a case, the acceptance probability in (11) should include the Jacobian term. We show in the next part that the ASI sampler is also random neighbourhood sampler but it takes a strategy unlike the locally balanced proposals. Instead of paying more attention to the within-neighbourhood proposal but using a naive neighbourhood construction, the ASI sampler more focuses on constructing sophisticated random neighbourhoods which is more likely to contain promising models and employs a random walk within-neighbourhood proposal.

Another take on the ASI scheme
It is not straightforward to observe that the ASI sampler is a random neighbourhood sampler, however we show below that in fact it can be. To do so, we introduce a random neighbourhood sampler, the Adaptive Random Neighbourhood (ARN) sampler, and prove that the ARN and ASI samplers are equivalent if they share some common adaptive parameters. We also illustrate that the ARN sampler differs from the locally informed approach, instead it puts more efforts on neighbourhood construction but its within-neighbourhood proposal follows a random walk.
We consider a random neighbourhood sampler with algorithmic tuning parameter θ = (ξη opt , ω) ∈ ( , 1 − ) 2p+1 := ∆ 2p+1 , where η opt is given in (4), and the tuning parameters ξ and ω are used in the random neighbourhood construction and the within-neighbourhood proposal respectively. In the random neighbourhood construction, the neighbourhood indicator variable k = (k1, . . . , kp) ∈ K = {0, 1} p is generated from the distribution where p RN ξη opt ,j (kj = 1|γj = 0) = ξA opt j and p RN ξη opt ,j (kj = 1|γj = 1) = ξD opt j . This is equivalent to the ASI proposal in (2) where kj = 1 if and only if γj = γ j . A neighbourhood N (γ, k) is obtained from γ and k for which γ is the "centre" of N (γ, k) and k indicates the possible indices altered from γ. These tuning parameters ξ and η opt are abortively updated on the fly. For any γ * ∈ N (γ, k), kj = 0 implies that γ * j = γj. This identity can be used to state a formal definition of the neighbourhood N (γ, k) as The neighbourhood contains 2 p k models where p k is number of 1s in k (i.e. p k := p j=1 kj). The parameter ξ affects the p k and therefore controls neighbourhood size. So we call ξ the neighbourhood scaling parameter.
The mapping ρ is chosen to be an identity function. The within-neighbourhood proposal within this adaptive random neighbourhood scheme is also based on the same proposal in (2) over the neighbourhood N (γ, k). It can be characterised as choosing the variables to be added or deleted from the model by thinning from within the set {j| kj = 1} with the thinning probability set to be ω ∈ (0, 1). We refer to this parameter as the unique within-neighbourhood proposal tuning parameter. A larger value of ω increases the probability of proposing γ further away from γ in Hamming distance. This can be written formally as the proposal in (2) where q THIN ω,1 (γj, 1 − γj) = ω and q THIN ω,0 (γj, 1 − γj) = 0. The proposal q THIN ω,k is symmetric and only generates new states inside the neighbourhood N (γ, k). This is because the probabilities of proposing flips on coordinates other than j such that kj = 1 are 0. The scheme is completed by accepting or rejecting the proposal using a standard Metropolis-Hastings acceptance probability Remark 2. An alternative formulation to (15) in terms of Hamming distance between γ and γ is where dH (γ, γ ) is the measure of Hamming distance between two models γ and γ .

Sample k ∼ p RN
ξη opt (·|γ) as in (14); Sample γ ∼ q THIN ω,k (γ, ·) as in (15) and U ∼ U (0, 1); if U < α ARN θ,k (γ, γ ) as in (16), then accept γ Algorithm 2 Adaptive Random Neighbourhood sampler (ARN) Algorithm 2 describe how a new state γ is proposed using the ARN scheme. We indicate the transition kernel by p ARN θ and the corresponding sub-transition kernel conditional on k by p ARN θ,k . They obey the relationship The following proposition helps to show that the ARN sampler is π-reversible.
Proposition 1 together with Proposition 2 show that the ARN transition kernel is π-reversible and therefore generates samples that preserve the target distribution π. In fact ARN and ASI are mathematically equivalent provided that the tuning parameter choices are made in a prescribed manner. To see this suppose that the tuning parameters of both the ARN and ASI schemes are fixed and share the same tuning parameter η. The following theorem shows that their transition probabilities from γ to γ are equal when ζ = ξ × ω holds.
In addition we deduce the following corollary.
Corollary 1 shows that two ARN kernels with different tuning parameters coincide in probability if the products of the neighbourhood scaling parameter ξ and proposal thinning parameter ω are equal. This corollary also suggests that magnitudes of ξ and ω can shift to each other without modifying the resulting proposal as long as their product preserves.

Adaptive random neighbourhood and informed samplers
It should be clear from the above discussion that both the locally informed and ASI schemes can be viewed as random neighbourhood samplers, and that in the former much work is done to select proposals within a neighbourhood, while in the latter most of the work is done in constructing a neighbourhood where many of the candidate models represent proposals that are likely to be accepted by the algorithm. Our main methodological contribution is to design a random neighbourhood sampler for which both the neighbourhood construction and within-neighbourhood proposal are designed in an informed way. We therefore consider using an adaptive random neighbourhood approach to construct neighbourhoods, followed by a locally informed approach to select a proposal from within this neighbourhood.
The advantages of combining the two schemes in this manner are worth highlighting. A key strength of ASI is that generating proposals is computationally cheap, but when components of the posterior distribution are highly correlated then the assumption of independence that is embedded into the proposal generation can lead to overly ambitious moves that will be rejected. To combat this, the scaling parameter must be used to control the acceptance rate, but in the presence of high correlation this can lead to small moves and slow mixing. The locally informed sampler can cope well with high levels of correlation in the posterior distribution, but in high-dimensions often the (uninformed) neighbourhood will often either contain no sensible models, or be so large that the cost of computing all of the balancing function values within it becomes prohibitive. Combining the two schemes is therefore an attractive proposition, as an intelligent neighbourhood that is also not too large can be constructed using ASI, and then correlation can be controlled for at the second stage by choosing the within-neighbourhood proposal using the locally informed approach.
We give the details of this informed and adaptive random neighbourhood sampler below, which we call the Adaptive Random Neighbourhood Informed (ARNI) sampler. After this we define the point-wise ARNI (PARNI) scheme, which enjoys the benefits of ARNI but with much lower computational cost.

Adaptive Random Neighbourhood Informed algorithm
We first describe a construction of random neighbourhood informed proposals. Suppose a random neighbourhood sampler is given with neighbourhood indicator variable k ∈ K and a update mapping ρ together with a within-neighbourhood proposal kernel Q k . The variable k follows a conditional distribution p(k|γ) whereas the proposal Q k produces a new state γ within neighbourhood N (γ, k) in an uninformed manner.
We consider the class of informed proposals Q g,k with mass function where g : [0, ∞) → [0, ∞) is a continuous function, and Z g,k (γ) is a normalising constant defined by The generated new state γ is accepted using the Metropolis-Hastings rule The proposal collapses to the locally balanced proposal of Zanella (2020) when the neighbourhood is non-stochastic and the within-neighbourhood proposal is symmetric. According to the locally balanced proposals, using a balancing function g that satisfies g(t) = tg(1/t) is optimal in Peskun sense. In random neighbourhood informed proposals, we also recommend using a balancing function g.
In what follows, we combine the above random neighbourhood informed proposal with the ARN scheme and develop an Adaptive Random Neighbourhood Informed (ARNI) proposal that uses an informed proposal at the within-neighbourhood proposal stage. In the ARNI scheme, the within-neighbourhood proposal in Algorithm 2 is replaced by where g is a balancing function such that g(t) = tg(1/t) holds and θ = (ξη opt , ω) ∈ ∆ 2p+1 . The last equation follows since the within-neighbourhood proposal q THIN ω,k is symmetric and therefore q THIN ω,k (γ , γ)/q THIN ω,k (γ, γ ) = 1 for all γ ∈ N (γ, k). The Metropolis-Hastings acceptance probability is tailored to the new informed proposal as To boost the convergence of these adaptive tuning parameters, the same multiple chain strategy as ASI should be implemented. In addition to the notations used in ASI, k l,(i) denotes the neighbourhood indicator variable for the l-chain at time i. For L multiple chains, the tuning parameters η opt are updated following the same scheme of ASI as in (6) and (7). Two scaling parameters ξ and ω can be updated using the Robbins-Monro schemes logit where p k is the size of k as mentioned previously, s is the target size of k, α l i is the acceptance probability at the ith iteration for the l-th chain and τ is the target average acceptance rate. We recommend using the diminishing sequence φi = i −0.7 for both updating schemes.
While the informed proposal is powerful in accelerating the convergence of the chains, it also introduces extra computational costs since the posterior probabilities of all models in a neighbourhood are required. Given a k of size p k , the resulting neighbourhood N (γ, k) consists of 2 p k models. Although it is possible to speed up the posterior calculations using Gray codes as introduced in George and McCulloch (1997), evaluating 2 p k models is still computationally expensive when p k is very large and leads to an inefficient scheme. One way to address the issue is to tune the neighbourhood scaling parameter to generate neighbourhoods with a desired size, say let s be 5. In our experience, such control of the size of k comes at the cost of reduced exploration of the model space and the ARNI scheme fails to achieve better performance than ASI. This motivated us to develop a more efficient implementation of this approach that controls computational cost but maintains good exploration properties.

The PARNI sampler
We consider a point-wise implementation of the ARNI scheme (for short, the PARNI scheme). This approach is motivated by the block-wise implementation in Zanella (2020) and the block design strategy in Titsias and Yau (2017). The main idea is that the variables with neighbourhood indicator equal to 1 are divided into a series of smaller blocks and the new model is sequentially proposed by sequentially adding or deleting variables in each block. The block design can lead to a significant reduction in the total number of models considered and so require less computational effort. For instance, suppose that there are p k non-zero neighbourhood indicator variables, which are divided into m equally sized blocks. The neighbourhoods generated by each block will have 2 m models. Working through each block to propose a new state requires evaluating 2 m p k /m posterior probabilities. As the computational cost is proportional to the total number of models considered, the computational cost is largest when m = p k where the only building block is the entire neighbourhood of N (γ, k). In contrast, the smallest computational cost occurs when m = 1 where each block has one variable and therefore contains two models only. Throughout the section, we consider the latter block design when m = 1 and the resulting algorithm is the PARNI sampler.

Main Algorithm
We now formally present the PARNI algorithm and show how a new state γ is proposed from a current state γ. We use the same random neighbourhood construction as the ARNI scheme, in addition, the neighbourhood scaling parameter ξ is set to be fixed at 1 to indicate that neighbourhood sizes are not reduced at this stage. In other words the neighbourhoods are generated with the optimal values η opt as in (4). After a neighbourhood N (γ, k) is sampled, we sequentially propose a sequence of new models with only 1-Hamming distance differences inside N (γ, k). We define K = {K1, . . . , Kp k } = {j|kj = 1} to be the set of variables for which kj = 1 (the order of variables is random). We also define a sequence of models, γ(1), . . . , γ(p k ) and neighbourhoods, N (1), . . . , N (p k ) to select the final model. Let e(1), . . . , e(p) be the basis vector of a p-dimensional Cartesian space where e(j)j = 1 and e(j) j = 0 whenever j = j. We consider the neighbourhoods constructed according to γ(j) and e(Kr) for r from 1 to p k . The first neighbourhood is N (1) = N (γ, e(K1)) and propose a model γ(1) from where g is a balancing function and θ = (η opt , ω). We repeat this process to construct the second neighbourhood N (2) = N (γ(1), e(K2)) and propose the model γ(2) from N (2). In general, at time r, we defined N (r) = N (γ(r − 1), e(Kr)) and propose a model γ(r) from We call this a position-wise probability measure because it only allows to alter or preserve position Kr for each r. Fig. 1 provides a flowchart of the PARNI scheme which only involves enumerating at most 2p k models rather than 2 p k models in the ARNI proposal. The parameters of the proposal are θ = (η opt , ω).
To the construct a π-reversible chain, the probability of the reverse moves is required. These reverse moves use K = ρ(K) as their auxiliary variables. The mapping ρ reverses the order of elements in K so that the variable K contains the same elements in K but with reverse order. The typical benefit is that it leads to identical intermediate models of forward and reverse proposals and the posterior probabilities of p k models are required instead of 2p k . Suppose that γ (r) for r = 0, . . . , p k are consecutive intermediate models used in the reverse move and N (r) for r = 0, . . . , p k are the neighbourhoods used in the reverse move. These models and neighbourhoods are identical to models and neighbourhoods used in the proposal move but with opposite order, in particular γ (r) = γ(p k − r) for r = 0, . . . , p k and N (r) = N (p k − r + 1) for r = 1, . . . , p k . The second benefit is that the design leads to a reduced form of Metropolis-Hastings probability of acceptance. Let Z(r) be the normalising constant of the r-th position-wise probability measure, q PARNI θ,Kr (γ(r − 1), γ(r)), and Z (j) denote the normalising constant of r-th position-wise probability measure in reversed move, q PARNI θ,K r (γ (r − 1), γ (r)). We have that We let q PARNI θ,k (γ , γ) be the full proposal kernel that satisfies where γ(0) is current state γ and γ(p k ) is the final proposal γ . The Metropolis-Hastings acceptance probability of the PARNI proposal is given as The reduced form of the Metropolis-Hastings acceptance probability is illustrated by the following proposition: Proposition 3. Suppose γ, γ ∈ Γ are fixed. For any θ = (η, ω) ∈ ∆ 2p+1 and k such that γ ∈ N (γ, k), the Metropolis-Hastings acceptance probability in (30) can be written where Z(r), Z (r) for r = 1, · · · , p k are the normalising constants given in (28).

Adaptation schemes for algorithmic parameters
The last building block to complete the PARNI sampler is the adaptation mechanism of the tuning parameters. Suppose a total number of L are used. The posterior inclusion probabilities πj are updated as the ASI scheme in (5). The magnitude of the proposal thinning parameter ω is crucial in the mixing time and convergence rate of chains. Therefore, we consider two adaptation schemes for updating ω, the Robins-Monro adaptation scheme (RM) and the Kiefer-Wolfowitz adaptation scheme (KW). For the rest of section, we assume L multiple chains are used for the PARNI sampler. The Robbins-Monro adaptation scheme is widely used in updating tuning parameters of adaptive MCMC algorithms. Andrieu and Thoms (2008) review several adaptive MCMC algorithms using variants of the Robbins-Monro process. Given a specified probability of acceptance τ , the Robbins-Monro adaptation scheme automatically adjusts ω according to the comparison between the current probability of acceptance and τ . It is generally considered to be a robust adaption scheme. Given the acceptance probability of the l-th chain at the i-th iteration α l i , the tuning parameter ω is updated through the law The theoretical optimal value of τ may not exist for every candidate proposal kernel and choice of posterior distribution. We recommend using a target acceptance rate of 0.65 based on a large number of experiments that will be illustrated in Section 5 and using the diminishing sequence φi = i −0.7 . The Kiefer-Wolfowitz scheme is a stochastic approximation algorithm and modification of the Robbins-Monro scheme in which a finite difference approximation to the derivative is used. In this scheme the tuning parameter is updated to target the optimiser of an objective function of interest. Following Pasarica and Gelman (2010), we use the Kiefer-Wolfowitz scheme to adapt ω and use the expected squared jumping distance as the objective function. The expected squared jumping distance is closely linked to the mixing and convergence properties of a Markov chain. To estimate the expected squared jumping distance at each iteration, we use the average squared jumping distance as an estimator. An alternative objective function can be the generalised speed measure introduced in Titsias and Dellaportas (2019).
To estimate the finite difference approximation to the derivative of the average squared jumping distance, we exploit the multiple chain implementation of PARNI. The multiple independent chains naturally provide independent samples which fits the Kiefer-Wolfowitz approximation. Our implementation of the Kiefer-Wolfowitz adaption scheme proceeds as follows. We first evenly divide L multiple chains into two equally sized batches, L + and L − . Let ci be a diminishing sequence, new proposals are generated using ω + = ω (i) + ci for chains in L + and ω − = ω (i) − ci for chains in L − . The average squared jumping distances for these batches (i.e. ASJD +,(i) and ASJD −,(i) ) are estimated using the new proposals and their corresponding probabilities of acceptance. The tuning parameter ω is then updated according to the rule We suggest using ai = i −1 and ci = i −0.5 in the Kiefer-Wolfowitz scheme. Further details of the Kiefer-Wolfowitz adaption scheme are given in A.1 of the supplementary material and a feasibility analysis of the Kiefer-Wolfowitz adaption scheme is carried out in C.2 of the supplementary material.
Pseudocode of the PARNI sampler is given in Algorithm 3. The corresponding transition kernel is referred to as p PARNI-• θ for • = RM or KW. In the next section we show that the PARNI sampler is π-ergodic and satisfy a strong law of large numbers.

Ergodicity and Strong Law of Large Numbers
The multiple chain acceleration can be thought of as the realisation of L runs on a product space Γ ⊗L with joint variable γ ⊗L = (γ 1 , . . . , γ L ) ∈ Γ ⊗L . The number of independent chains L is satisfied L ≥ 1 if Robbins-Monro adaptation is used and L ≥ 2 if Kiefer-Wolfowitz adaptation is used. We consider a posterior distribution π on the space Γ which is of the form where both p(y|γ) and p(γ) are analytically available. In addition, the joint posterior distribution π ⊗L on the product set Γ ⊗L is given as In this section, the symbol • represents either KW or RM. The sub-proposal mass function of the PARNI − • sampler given neighbourhood indicator variable k and tuning parameter θ = (η, ω) is defined by The full transition kernel of the PARNI sampler is marginalised over all possible k where the sub-transition kernels given k are and α PARNI θ,k are Metropolis-Hastings acceptance rates in (31). The Markov chain transitional kernel that works on the product space Γ ⊗L is given as To establish the ergodicity and a SLLN of the PARNI sampler and its multiple chain acceleration, we require the following assumptions: (A.1) For some small constant ε > 0 and constant Cg, the balancing function g satisfies where ε < t1 < t2. This is a common condition for the proper choice of balancing functions. For example, Hastings' choice gH(t) = min{1, t} follows (40) immediately for Cg = 1 and Barker's choice gB = t/(1 + t) also follows (40) when Cg = 1 (i.e. the maximum derivative). (A. 2) The posterior distribution π is everywhere positive and bounded, that is, there exists a positive Π ∈ (1, ∞) such that 3) The tuning parameters θ (i) = (η (i) , ω (i) ) are bounded away from 0 and 1, and lie in the set for some small ∈ (0, 1/2).
The analysis of convergence and ergodicity often relies on the distribution of the Markov chain at time i along with its associated total variation distance · T V at an arbitrary starting point. Given {γ l,(i) } ∞ i=0 these are defined as We show here that the PARNI sampler is ergodic and satisfies a strong law of large numbers (SLLN). In mathematical terms for any starting point γ ⊗L ∈ Γ ⊗L and θ ∈ ∆ 2p+1 ergodicity means that for any l = 1, . . . , L, while a strong law of large numbers (SLLN) implies that almost surely, for any f : Γ → R. We first establish two technical results before presenting the main theorem of this section.
5 Numerical studies

Simulated data
We consider the data generation model introduced by Yang et al. (2016), and replicated in simulation studies conducted by  and Zanella and Roberts (2019). Suppose a linear model with n observations and p covariates is needed, data are generated from the model specification where ∼ Nn(0, σ 2 In) for pre-specified residual variance σ 2 and β * = SNR ×β (σ 2 log p)/n in which SNR represents the signal-to-noise ratios. Letβ = (2, −3, 2, 2, −3, 3, −2, 3, −2, 3, 0, · · · , 0) and each row of the design matrix X * i follow a multivariate normal distribution with mean zero and covariance Σ with entries Σjj = 1 for all j and Σij = 0.6 |i−j| for i = j. We consider four choices of SNR, namely 0.5, 1, 2 and 3, two choices of n, namely 500 and 1,000 and three choices of p, namely 500, 5,000 and 50,000.
We use the prior parameter values Vγ = Ip γ , g = 9 and h = 10/p.  give a detailed description of the resulting posterior distributions. In the presence of a low SNR (SNR = 0.5), there is too much noise to detect the true non-zero variables and the resulting posterior is rather flat, with no variables having posterior inclusion probabilities larger than 0.1. The posterior distributions are completely different when the SNR is large (SNR = 2 and SNR = 3). In these cases all of the true non-zero variables have inclusion probabilities close to 1 as the posterior distributions are more concentrated. In the intermediate case SNR = 1 slightly less than half of the true non-zero variables have inclusion probabilities above 0.8. In general the problem of finding the true non-zero variables becomes more difficult in the cases with lower SNR, smaller n and larger p.
We are interested in comparing the performance of the ASI and PARNI schemes relative to an Add-Delete-Swap sampler because the ASI scheme has been compared with several other state-of-the-art MCMC algorithms in . The adaptive algorithms are run with 25 multiple chains. In addition, to reduce the computational budget, all the adaptations terminate after the period of burn-in.
The choice of balancing function mainly focuses on three particular candidates: square root function gsq(t) = √ t, Hastings' choice gH(t) = min{1, t} and Barker's choice gB = t/(1 + t). The comparisons of these balancing functions in Supplement B.1.3 of Zanella (2020) illustrate two major findings. The Hastings' and Barker's choices only differ by at most a factor of 2 due to their similar asymptotic behaviors. The square root function mixes worsen outside the burn-in phase. Therefore, we consider the Hastings' choice throughout the section. Similar results are also expected for the Barker's choice.
Trace plots of chains are a straightforward way to visualise convergence. Fig. 2 are the trace plots of posterior model probabilities from the Add-Delete-Swap, ASI, PARNI-KW and PARNI-RM algorithms for the first 1,500 iterations when the SNR = 2. The Add-Delete-Swap scheme fails to converge for all choices of n and p and in particular becomes trapped at the empty model for a long period of time when p = 50, 000. The ASI scheme converges reasonably quickly when p is 500 or 5, 000, but takes longer to reach high probability regions when p = 50, 000. This suggests that ASI mixes worse and converges slower in high-dimensional datasets. On the other hand, both the PARNI-KW and PARNI-RM samplers mix rapidly in this setting.
The trace plots are not truly a fair comparison as they do not take into account running time. To better address the issue of computational efficiency we repeatedly ran all of the algorithms for 15 minutes and stored the estimates of posterior inclusion probabilities. We calculated mean squared errors of these estimates compared to "gold standard" estimates taken from a weighted tempered Gibbs sampler that has ran for any extremely long time. We show results in the form of performance relative to the Add-Delete-Swap scheme in Table 1. Smaller values always indicate better performance of the scheme. The value of -1 indicates the scheme yields 10 times smaller mean squared errors compared to those from the Add-Delete-Swap scheme in this specific dataset. Generally speaking, the mean squared errors for important variables are greater than those for important variables for almost every dataset and scheme. The choice of n does not significantly affect the performance of the samplers. Concentrating on the results for important variables, the ASI scheme leads to an order of magnitude improvement in efficiency over the Add-Delete-Swap sampler, which match the results in . The two PARNI algorithms with different adaptations lead to similar levels of accuracy and dominate both the ASI and Add-Delete-Swap schemes in every case when p > 500. In particular, the PARNI schemes result in roughly 10 5 times improvements over Add-Delete-Swap and 10 times improvements over ASI when p = 50, 000 and SNR= 2. On the other hand, the Add-Delete-Swap scheme is quite adept at removing the unimportant variables when the true model size is small compared to the number of covariates. When p = 50, 000 and SNR> 1 the ASI scheme struggles with unimportant variables and has yet to give estimates better than Add-Delete-Swap, but the PARNI algorithms produce better estimates even for these unimportant variables. Overall, the results suggest both PARNI samplers are more computationally efficient than alternatives when p is large. More results from simulated data are provided in Section C.3 of the supplementary material.

Real data
We consider eight real datasets implemented in , four of them with moderate p and four with larger p.
The first dataset is the Tecator dataset, which is previously analysed by Brown and Griffin (2010) in Bayesian linear regression and implemented by Lamnisos et al. (2013) and  in the context of Bayesian variable selection. It contains 172 observations and 100 explanatory variables. We also consider three small-p data sets constructed by Schäfer and Chopin (2013) to illustrate the performance of sequential Monte Carlo algorithms on Bayesian variable selection problems, the Boston Housing data (n = 506, p = 104), the Concrete data (n = 1030, p = 79) and the Protein data (n = 96, p = 88). These data sets are extended by squared and interaction terms which lead to high dependencies and multicollinearity.
The last four data sets are high-dimensional problems with very large-p. Three of them come from an experiment conducted by Lan et al. (2006) to examine the genetics of two inbred mouse populations. The experiment resulted in a set of data with 60 observations in total that were used to monitor the expression levels of 22, 575 genes of 31 female and 29 male mice. Bondell and Reich (2012) first considered this dataset in the context of variable selection. Three physiological phenotypes are also measured by quantitative real-time polymerase chain reaction (PCR), they are used as possible responses and are named PCRi for i = 1, 2, 3 respectively. For more details, see Lan et al. (2006); Bondell and Reich (2012). The last dataset concerns genome-wide mapping of a complex trait. The data are illustrated in Carbonetto et al. (2017). They are body and testis weight measurements recorded for 993 outbred mice, and genotypes at 79, 748 single nucleotide polymorphisms (SNPs) for the same mice. The main purpose of the study is to identify genetic variants contributing to variation in testis weight. Thus, we consider the testis weight as response, the body weight as a regressor that is always included in the model and variable selection is performed on the 79,748 SNPs.
Before analysing the performance of MCMC algorithms on the above datasets, it is worth discussing the selection of an optimal acceptance rate for the PARNI-RM sampler. The optimal scaling property of a Gaussian random walk proposal on some specific forms of target distribution is a well-studied problem. The most commonly used guideline is to seek an average acceptance rate of 0.234 (Gelman et al., 1997). The optimal acceptance rates for sophisticated informed proposals involving gradient information are typically larger, e.g. 0.57 for the Metropolis-adjusted Langevin algorithm (Grenander and Miller, 1994;Roberts and Rosenthal, 1998) and 0.65 for Hamiltonian Monte Carlo (Duane et al., 1987;Beskos et al., 2013). As our balanced random neighbourhood proposals can be viewed as a discrete analog to these gradient-based algorithms, it is natural to think that the PANRI sampler will have a larger optimal acceptance rate than a random walk Metropolis. To test this we ran the PARNI-RM scheme targeting different rates of acceptance on the above datasets. Fig. 3 shows the effect of the average acceptance rate on the expected squared jumping distance and average mean squared errors of the sampler. Parts (a) and (b) of the figure illustrate the relation between the thinning parameter ω and the average acceptance rate. Bigger values of ω are synonymous with larger jumps and therefore can lead to a smaller average acceptance rate. Parts (c) and (d) of the figure suggest that the maximum average squared jumping distance occurs when the acceptance rate is around 0.65 for all datasets. Parts (e) and (f) show that the average mean squared error is minimised when acceptance is around the same value. Therefore, we recommend targeting an average acceptance rate of 0.65 in most problems. Similar results for the simulated datasets of 5.1 are presented in C.1 of the supplementary material. We stress that the PARNI-KW scheme does not require a target acceptance rate to be chosen, so users who are uncomfortable with having to choose this quantity for a particular dataset are recommended to use this version of the sampler.
We consider a total of seven different MCMC schemes for these sets of data. In addition to the four schemes used in the simulation study (Add-Delete-Swap, ASI, PARNI-KW and PARNI-RM), we also implement 3 state-of-the-art algorithms, the Hamming ball sampler (HBS) with radius of 1 of Titsias and Yau (2017), and also both the tempered Gibbs sampler (TGS) and weighted tempered Gibbs sampler (WTGS) of Zanella and Roberts (2019). The prior specification for each dataset is given in Table 2. Fig. 4 shows trace plots of posterior model probabilities from the Add-Delete-Swap, ASI, PARNI-KW and PARNI-RM algorithms for the first 1,500 iterations in all eight real datasets. It is clear that the Add-Delete-Swap scheme does not mix well since it struggles to explore model space. All algorithms do reach high probability regions for datasets with moderate p in roughly the same number of iterations, however both the PARNI schemes appear to mix better than both Add-Delete-Swap and ASI and explore more models around the model space. In the large-p datasets, these algorithms lead to different behaviour. The Add-Delete-Swap scheme gets trapped in the empty model whereas the ASI algorithm does not converge properly for the first 1,500 iterations. The PARNI schemes, by contrast, accept almost every proposed states and mix very quickly. The figure suggests that both the PARNI schemes mix no worse than Add-Delete-Swap and ASI in moderate p problems, and better in large p problems.
We next turn attention to the average mean squared errors on these eight real datasets. The PARNI samplers do not dominate other schemes in moderate-p datasets, but they still lead to impressive results that are not close to optimal. The worst performance is for the Boston Housing and Concrete datasets, which are multi-modal and contain intricately correlated covariates. For large-p problems both the PARNI schemes significantly outperform other samplers. Surprisingly, the HBS and TGS schemes lead to worse estimates than Add-Delete-Swap. This can be explained by the computational cost per iteration of the HBS, TGS and WTGS algorithms, which is linear in p. These large computational costs combined with the issue of rarely exploring important variables lead to low efficiencies for HBS and TGS. The WTGS algorithm still outperforms TGS, which coincides with the conclusions gathered in Zanella and Roberts (2019) where the WTGS algorithm is shown to have smaller relaxation time than TGS. The ASI algorithm gives competitive estimates to WTGS in high-dimension but is eventually outperformed by the PARNI schemes. Among the PARNI schemes, the PARNI sampler with Kiefer-Wolfowitz adaption generally performs better than the Robbins-Monro version, but only by a small margin. This is due to the fact that the optimal acceptance rates are problem-specific and not exactly 0.65 for every dataset.

Discussion and future work
In this paper we present a framework for neighbourhood based MCMC algorithms, and propose a new scheme as an informed counterpart to the ASI algorithm in , using elements from locally balanced Metropolis-Hastings introduced in Zanella (2020). To address the expensive computational costs introduced by the informed proposal, we introduce two less computationally costly algorithms, the PARNI schemes, which can lead to a dramatic improvement in performance. The two PARNI schemes employ different adaptation schemes for the thinning parameter, the Kiefer-Wolfowitz and Robbins Moron schemes. The success of these new samplers is attributed to two aspects. Firstly the adaptation helps to explore the areas of interest (mainly with high posterior probabilities), and secondly the locally informed proposals are able to stabilise random walk behaviour in high-dimensions and lead to rapidly mixing samplers in practice. From the numerical studies on both simulated and real datasets, we recommend using a PARNI sampler with the Kiefer-Wolfowitz scheme for tackling high-dimensional (or large-p) Bayesian variable selection problems. We note that it can still be challenging for the PARNI samplers to move across low probabilistic regions, which could affect performance when the posterior has very isolated modes. This phenomenon is due to the fact that the PARNI samplers propose models sequentially where each sub-proposal can alter only 1 position at most. On the other hand, the original ARNI scheme can take larger jumps and is more able to explore well-separated modes, albeit with a substantial increase in computational costs. In summary, new schemes like PARNI show the potential of combining adaptive, random neighbourhood and informed proposals. We look forward to adding more theoretical support to the numerical evidence shown here in future work. In addition, the code to run the PARNI samplers and aforementioned numerical studies is given in https://github.com/XitongLiang/The-PARNI-scheme.git.
The paper provides many directions for extensions and future work. Some recent work has shed light on the issues of extra computational costs that come with informed proposals. Grathwohl et al. (2021) develop an accelerated locally informed proposal that uses derivatives with respect to the log mass functions. It is possible to derive the gradient of the posterior mass function with respect to γ with minor modifications on representations of the posterior distribution π(γ). To address the lack of mode jumping in the PARNI schemes, we can first try to construct larger blocks intelligently so that separated models are covered in one single block. This solution can be achieved by introducing basis vectors beyond the Cartesian case in the block construction. One can also use the sequential Monte Carlo methods of Ji and Schmidler (2013), Schäfer and Chopin (2013) and Ma (2015), which are more able to handle multimodality. Combining them with PARNI yields the chance of producing efficient methods on highly multimodal posterior distributions with well-separated modes. Another option in this direction is the JAMS algorithm of Pompe et al. (2020) that first locates each individual mode and then produces a mixture proposal that involves jumps within and between modes.
We intend to also study the performance of the PARNI schemes in generalised linear models as in Wan and  or a more flexible Bayesian variable selection model such as that suggested by Rossell and Rubio (2018). In these cases, regression coefficients and residual variance are no longer integrated out analytically and the likelihood of γ is not available in closed form. Informed proposals for such models are computationally challenging because the proposals involve the evaluations of these likelihood but the required approximations and estimates of the marginal likelihood are computationally intensive. One possible approach is the data-augmentation method using the Pólya-gamma distribution as described in Polson et al. (2013). The design does however require some care to avoid inefficiency causing by introducing a large number of auxiliary variables in large-n problems. We also believe that their is potential to use random neighbourhood samplers beyond variable selection, and aim to consider to other discrete-valued sampling problems in future work.

A.1 The Kiefer-Wolfowitz adaption scheme
The optimal scaling property of a Gaussian random walk proposal on some specific forms of target distribution is well-studied. The most commonly used way to achieve optimal mixing time is to tune scaling parameters, which leads to an average acceptance rate of 0.234. In practice, even in those cases where the posterior distribution does not strictly obey the assumptions, the average acceptance rate of 0.234 is often a suitable guide and results in good practical performance. For those proposals beyond random walks, the guidelines for optimal tuning are often unknown. Due to this fact, we develop an adaptation scheme which is able to adapt the tuning parameters in which the mixing time and convergence rate are optimised without knowing any theoretical results in advance.
We design a scheme that maximises the Expected Squared Jumping Distance (ESJD). The ESJD is an efficiency measure which accounts for the jumping distances between two consecutive states from a Markov chain which is highly related to first order autocorrelation (Pasarica and Gelman, 2010). Suppose that ω ∈ R is a continuous tuning parameter of a π-reversible transition kernel pω. The definition of ESJD given parameter ω is given as follows If pω is a Metropolis-Hastings transition kernel and can be decomposed into a product of a proposal kernel Qω with mass function qω and a term of the Metropolis-Hasting acceptance probability αω(γ, γ ), the definition of the ESJD above is equivalent to The ESJD is often infeasible to compute since it involves double sum over the sample space. To access the value of ESJD, we consider an estimator Average Squared Jumping Distance (ASJD) which depends on the past chain and ASJD is defined as follows: or alternatively where (γ (i) ) is the proposal of γ (i) through Qω. From above, the main advance of using ASJD is that ASJD can be easily estimated in each individual iteration. The objective is to locate the value of ω that leads to the largest ASJD. This is equivalent to solving the following optimisation problem of the tuning parameter If objective function ESJD(ω) is unimodal and smooth, ω * can be found by solving the first order ordinary differential equation The Robbins-Monro scheme can be applied here to adaptively update the optimal θ when those derivatives exist analytically. In most cases, however, the derivatives are not available analytically, which makes the Robbins-Monro scheme impossible to use. The Kiefer-Wolfowitz scheme (Kiefer and Wolfowitz, 1952), on the other hand, is an alternative to the Robbins-Monro algorithm where the derivatives are estimated using a finite difference method.
The following is how a Kiefer-Wolfowitz scheme proceeds. Let M (ω) be an objective function with a maximum θ * . If M (ω) is assumed to be unknown but some random observations M(ω) are given such that M (ω) = E[M(ω)], ω is updated following an iterative algorithm as follows where ai and ci are two diminishing sequences of forward positive step sizes and finite difference widths used respectively. In each iteration, we need two independent observations, M(ωi + ci) and M(ωi − ci), with tuning parameters ωi + ci and ωi − ci respectively. If the objective function M (ω) satisfies certain regularity conditions, it can be shown that ωi will converge to the optimal value ω * as n → ∞. Blum et al. (1954) show that this convergence is almost sure provided that some other conditions hold, most importantly that the diminishing sequences ai and ci satisfy We design an adaptive MCMC sampler which combines the Kiefer-Wolfowitz scheme and parallel chain implementation. The parallel chain implementation involves a number of independent chains which only share the same tuning parameters and provides independent observations as the Kiefer-Wolfowitz scheme requires. We consider a sampler which involves L parallel chains. In the sampler a new state is proposed through the kernel Qω, which is then accepted with probability αω. An adaptation scheme with the Kiefer-Wolfowitz updates is given as follows: compute ai = i −φa and ci = i −φc ; calculate ω + = ωi + ci and ω − = ωi − cn; separate the N chains into two groups, L − and L + ,; for each l ∈ L + , propose new state (γ (l,i) ) using Q ω + (γ (l,i) , ·), accept (γ (l,i) ) with probability α ω + (γ (l,i) , (γ (l,i) ) ; for each l ∈ L − , propose new state (γ (l,i) ) using Q ω − (γ (l,i) , ·), accept (γ (l,i) ) with probability α ω − (γ (l,i) , (γ (l,i) ) ); compute the ASJD for the current iteration by averaging over the chains in groups L + and L − respectively as follows: where • is either + or −; update the tuning parameter for the next iteration by
Then the algorithm can be viewed as a Markov chain on the larger state space (γ, γ , k), which alternates between the following steps: 1. Re-sample k, γ |γ from its conditional distribution with mass function p(k|γ)q k (γ, γ )
We then show q ω,k (γ, ·) is a probability measure on set N (γ, k) for any γ ∈ Γ and k ∈ K. Let J be a projection of k to a set J(k) consisting of the indices j for which kj = 1 (i.e. J(k) = {j|kj = 1}). Starting from equation (17) of Remark 2 together with the identity used above Γ p(dγ) = j Γ j pj(dγj), we obtain as required.

B.3 Proof of Theorem 1
Before proving Theorem 1, we first draw some conclusions on acceptance probability of the ASI and ARN proposals.
Proposition 4. Suppose γ is the current state, γ is the proposed state and η ∈ ∆ 2p is fixed parameter. For any k that constructs neighbourhood containing γ and γ and any choices of ξ ∈ ∆ and ω ∈ ∆ , the Metropolis-Hastings acceptance probability of the ARN proposal, α ARN (ξη,ω),k , as in (16) is fixed. In addition, this term is also the acceptance probability of the ASI proposal, i.e.
Proof of Proposition 4. Suppose that γ, γ ∈ Γ, η ∈ ∆ 2p are given and fixed. We consider all the k ∈ K such that γ ∈ N (γ, k). We are going to show that for any ξ and ω ∈ ∆ , the acceptance probability α ARN θ,k (γ, γ ) is free from the choice of k, ξ and ω.
To locate the different positions between γ and γ , we define the set J(γ, γ ) : where the last equality follows since q THIN ω,k is symmetric. Substituting p RN ξη yields The value of α ARN θ,k (γ, γ ) only depends on the choice of η and it does not involve the terms ξ, ω and k. Therefore, the proposition is proved. In addition, this is also the Metropolis-Hastings acceptance probability for ASI of proposing γ to γ following the definition of the ASI sampler. Now we formally prove Theorem 1.
Proof of Theorem 1. Rewriting the transition kernel of p ARN (ξη,ω) gives for which the last line follows from Proposition 4 and the fact that ζ = ξ × ω.
Recall the definitions of the conditional distribution of k in (14) and the within-neighbourhood proposal in (17). Together with η = (A, D), we have Let J(γ, γ ) be a set which consists of the indices j for which γj = γ j and J(γ, γ ) c be the complement to J(γ, γ ). By definition kj must be 1 when j ∈ J(γ, γ ) and kj can take values either 0 or 1 when j ∈ J(γ, γ ) c . Continuing from (B.4), we divide j = 1 to p into two groups, J(γ, γ ) and J(γ, γ ) c , and obtain We am going to further investigate the terms Ij for j ∈ J(γ, γ ) c . Clearly that, if γj = 1, then similarly that when γj = 0, we have Putting everything back to p ARN ξη,ω , and reconstructing the product in j from 1 to p gives Rewriting the above in terms of ζ = ξ × ω yields

B.4 Proof of Corollary 1
Proof of Corollary 1. It is clear that the argument holds from the Theorem 1 and its proof.
Most terms can be cancelled out and this leaves the first numerator and the last denominator I = π(γ) π(γ ) .
We now deal with the second term II. Substituting the values gives We know that the positions K1, . . . , Kp k are distinct and the vectors e(K1), . . . , e(Kp k ) can recover the auxiliary variable k. Therefore, we obtain Following the above arguments, the product of sequence j = 1, . . . , p k can be simplified The Metropolis-Hastings acceptance probability in (30) is · II · III From (B.5) and (B.6), we have · II = 1 and we therefore obtain as required.

B.6 Proof of Lemma 1
The proof of Lemma 1 is structured as proof of Lemma 1 in .
Proof of Lemma 1. We first introduce some preliminary work. Since θ = (η, ω) ∈ ∆ 2p+1 , we have for any k and γ. Similar arguments implies that (B.8) since p k ≤ p for all k. We also bound the Metropolis-Hastings acceptance probability in (30) from below where πm = minγ π(γ). Finally, we can chose b such that and therefore We say that Γ is (1, b, π(·))-small, and the simultaneous uniform ergodicity of the chain has been established following Theorem 8 of Roberts et al. (2004). For L multiple chains, the arguments are similar, but instead, the target distribution is now π ⊗L (γ ⊗L ) for γ ⊗L ∈ Γ ⊗L and Γ ⊗L is (1, b ⊗L , π ⊗L (·))-small for which

B.7 Proof of Lemma 2
Before proving the lemma, we require the following inequalities and its generalised version for which the last line follows from the triangle inequality.
We can generalise the above lemma and obtain the following.
Lemma 4. If aj, bj ≤ C for some constant C > 0, then for some constant C1. C1 can be chosen to be C p−1 . Proof.
As aj, bj ≤ C, aj/C and bj/C are smaller than 1.
The following lemma shows the diminishing rate of the proposal thinning parameter ω for both schemes (the PARNI-KW and PARNI-RM proposals).
Proof of Lemma 5. The update rule of (32) immediately leads to for λ = 0.7. For the , the values of tuning parameters ω adopted involve diminishing sequence ci. We are therefore interested in By applying the triangle inequality, we obtain Starting from the first term S1 and rearranging (A.9), we obtain where the second line follows from the fact that the expected jumping distances are bounded above by p.
Substituting the definition of ci into S2 yields Since both terms S1 and S2 are of the same order of O(i −0.5 ), the equation (B.22) is also O(i −0.5 ), which completes the proof.
We also require the following lemma to bound transition kernels by proposal kernels. The lemma and its proof is inspired by Lemma 4.21 in Łatuszyński et al. (2013).
Lemma 6. The sub-proposal kernel in (36) and sub-transition kernel in (38) obey the following relationship: for some constant C < ∞.
Proof of Lemma 6. Let the left-hand side and right-hand side of (B.23) be L and R respectively, namely Starting from the definition of sub-proposal kernel and substituting it into the left-hand side of (B.23), we obtain Starting with the term I and substituting in the definition of  gives Using | min{a, b} − min{c, d}| < |a − c| + |b − d| to further split I where the last line follows from the assumption (A.2). An analogous calculation gives which together with (B.27) implies that for any values of γ, γ , k and O, hence, it is enough to prove for C = (3 + 2Π) as required.
The proof of Lemma 2 is structured similarly to the proof of Lemma 2 in .
Proof of Lemma 2. The proof is structured as follows: firstly, we re-write the problem as a sum of subtransition kernels and bound the sub-transition kernels by sub-proposal kernels; secondly, we break sub-proposal kernels into various parts; lastly, we bound each part individually and hence bound the proposal kernels. Starting from the total variation norm in (46) and substituting in the definition of P PARNI θ in (37), we have for some constant C1 < ∞. Using Lemma 6, the problem is reduced to bounding the largest variations in two consecutive proposal kernels for some constant C2 < ∞. Plugging in the definition of ψ PARNI θ,k into (36), therefore, for some constant C3 < ∞. Applying Lemma 3 to (B.31), we obtain Now we move our attention to the next part of the proof where we are going to bound terms II and III respectively.
Starting with II, recall the definition of p RN η in (14) and η = (A, D), we have Following similar arguments to the proof of Lemma 2 of , we obtain From the definitions of Aj and Dj, we have (B.33) The pseudo-code of the PARNI sampler in (3) states thatπ are the Rao-Blackwellised estimates of posterior inclusion probabilities πj. Note that for all j from 1 to p, and therefore A similar conclusion can be drawn for each Dj, meaning for some constant C4 < ∞. The second term, III, is more complicated. We start by substituting sub proposal kernels in (27) to III yielding where γ(0) = γ and γ(p k ) = γ . Applying Lemma 3 to (B.36), we have The sub-proposal kernels typically contain two moves, either flipping position Kr or keeping it. Term IV is smaller than the maximum probability of these two moves. Let V be the absolute difference in flipping and VI be the absolute difference in keeping, then We next consider terms V and VI. Starting with V and substituting (27) .
Reduce the fractions to a common denominator to yield V = ω (i+1) g π(γ(r)) π(γ(r−1)) where γ d (r) = γ(r)K r − γ(r − 1)K r and the last line follows from (B.12) in the proof of Lemma 1 where all normalising constants can be bounded above and below. Using Lemma 4, we obtain :=IX for some constants C5, C6, C7 < ∞. We can apply similar arguments to VI giving for some constants C8, C9 < ∞. Starting with IX, by substituting in the definitions in (28), we have and applying the triangle inequality yields for some constants C10, C11 < ∞. The last line follows after applying Lemma 3. The diminishing adaptation of tuning parameter ω is shown in Lemma 5 where for some constant C12 < ∞ and λ = 0.5. We now consider term VIII. From assumption (A.1), we have and therefore for any t1, t2 > 0. We then have VIII ≤ Cg π(γ(r)) π(γ(r − 1)) where the last line follows from Assumption (A.2). Considering two possible values of γ d (r), namely 1 and −1, we show that VIII is bounded by the maximum of those values Next, multiplying the common denominator yields which holds because π0 ≤ Aj, Dj ≤ 1 from the proof of Lemma 1. Then, by applying Lemma 3, we have for some constant C13 < ∞.
As we have previously showed that for some constants C14, C15 < ∞.
We complete the proof by stating that IV ≤ C16i −λ for some constant C16 < ∞, and hence I ≤ C17i −λ for some constant C17 < ∞ and λ = 0.5, which shows that diminishing adaptation for the PARNI sampler has established.

B.8 Proof of Theorem 2
Proof. Simultaneous uniform ergodicity together with diminishing adaption are enough to show that π is stationary for each kernel P PARNI θ and the adaptive algorithm is ergodic from Theorem 1 in Roberts and Rosenthal (2007). Its multiple chain version is also ergodic with respect to π ⊗L .
The proof of the Strong Law of Large Numbers (SLLN) contains two steps. Firstly, we show that each individual chain satsifies a SLLN, that is Then by averaging over L parallel chains, we can show that the SLLN in (45) is satisfied for the multiple chain version. We use Theorem 2.7 in Fort et al. (2011) to show that (B.41) holds. To do so, three conditions are required.
(Con.1) The measurable function V can be chosen to be the constant function V ≡ 1, where V -variation distance norm reduces to the total variation distance. It is obvious that the below is met if with λ θ = 1/2, b θ = 1, the measure ν θ is uniformly distributed on the sample space Γ, that is ν θ (γ) = 1 2 p , with δ θ = b (the lower bound for a single chain in Lemma 1), then is satisfied.
(Con.2) Using the same parameters specified in (Con.1), the problem is reduced to This is satisfied since the PARNI sampler satisfies diminishing adaption by Lemma 2.

C Additional numerical results
This section will provide more numerical results in addition to Section 5.
C.1 Sensitivity analysis on thinning parameter ω for simulated datasets We repeat the experiment which studies the optimal value of ω and optimal average acceptance rate in Section 5.2 on simulated datasets used in Section 5.1. Fig. 5 shows the effect of manipulating the target average acceptance rate on average squared jumping distance and average mean squared errors. We split the plots into 4 sets where each set of graphs corresponds to a level of signal-to-noise ratio. Panels (a)-(d) show the negative relationship of ω against average acceptance rate. Panels (e)-(h) plot the average squared jumping distance against the average acceptance rate. Finally, panels (i)-(l) show the average acceptance rate and the average mean squared errors. These plots suggest the similar conclusion in Section 5.2 for which the average acceptance rate of 0.65 yields the largest average squared jumping distance. The smallest average mean squared errors are also located around the region of 0.65 average acceptance rate.

C.2 Additional results from Kiefer-Wolfowitz adaptation scheme
This section is to examine whether applying the Kiefer-Wolfowitz adaptation scheme to the PARNI sampler would lead to the optimal scaling property of the chains. We repeatedly ran the PARNI-KW sampler for 1,500 iterations with 3 different initial values on the 24 simulated datasets of Section 5.1 and the 8 real datasets of Section 5.2 and recorded the values of ω. The trace plots of these ω values are given in Figs. 6, 7, 8 and 9 for the simulated datasets and Fig. 10 for the real datasets. The black horizontal lines in these plots indicate the empirical optimal values of ω gathered for each dataset from Fig. 3 and 5. The optimal values decrease along with the dimensionality of p and they are also influenced by the correlation structure for which more complicated correlation structures imply smaller values of ω. It appears that the values of ω are moving towards the black lines and converging to them regardless of initial values. There is a significant trend that the ω values are approaching the region around the optimal values fairly quickly. Surprisingly, the parameter ω converges even faster in high-dimensional problems, for example, when p = 50, 000 in simulated datasets and the SNP dataset. But there is still a rare chance that the Kiefer-Wolfowitz scheme does not lead to the optimal choice. Some of ω values become trapped on the value of 1. This issue is mainly caused by two reasons. Firstly, the ASJD estimators are often too noisy to capture the true signal in the expected squared jumping distances. These estimators can be viewed as simple Monte Carlo with only a few samples and therefore we may obtain estimates with extremely large estimation errors. Large errors introduce uncertainty into the true direction that should be updated and make ω take longer to converge or converge to a suboptimal value. Secondly, the use of the logistic transformation makes ω difficult to be updated in the boundary areas and therefore ω easily get trapped in values of 0 or 1.
Overall, the Kiefer-Wolfowitz adaptation scheme is relatively robust in tuning ω for the PARNI sampler, and we believe it can also be applied to other adaptive MCMC schemes in tuning the scaling parameters.

C.3 More results from simulated datasets
In addition to Fig. 2, Figs. 11, 12 and 13 are trace plots of log posterior model probabilities from the Add-Delete-Swap, ASI, PARNI-RM and PARNI-KW schemes on the simulated datasets of Section 5.1 when SNR = 0.5, 1 and 3. Generally speaking, the PARNI-RM and PARNI-KW algorithms mix better than the Add-Delete-Swap and ASI schemes on all datasets. Except for the datasets for which the posterior distributions do not concentrate in a few models (when SNR = 0.5), the Add-Delete-Swap scheme always get struck on the empty model and struggles to include important variables and reach the high probability region within the first 1,500 iterations. The ASI algorithm mixes quite well when p is relative small, but this algorithm is taking longer to converge and it is inefficient to jump between different models when p reaches 50, 000. On the other hand, the PARNI-RM and PARNI-KW samplers only take dozens of iterations to converge properly in all settings. In conclusion, the plots suggest that both the PARNI schemes outperform Add-Delete-Swap and ASI in terms of the mixing time and convergence rate on the simulated datasets. Simulated data: plots of average squared jumping distance and average mean square error against average acceptance rate and ω. (a) -(d) average acceptance rate against ω for simulated datasets with signal-to-noise ratio of 0.5, 1, 2 and 3; (e) -(h) average squared jumping distance against average acceptance rate for simulated datasets with signal-to-noise ratio of 0.5, 1, 2, 3; (i) -(j) average mean squared error against average acceptance rate for simulated datasets with signal-to-noise ratio of 0.5, 1, 2 and 3. They always propose models with high probability of being accepted and therefore sufficiently explore the sample space.