Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable selection

Liang, Xitong; Livingstone, Samuel; Griffin, Jim

doi:10.1007/s11222-022-10137-8

Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable selection

Open access
Published: 30 September 2022

Volume 32, article number 84, (2022)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable selection

Download PDF

2178 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

We introduce a framework for efficient Markov chain Monte Carlo algorithms targeting discrete-valued high-dimensional distributions, such as posterior distributions in Bayesian variable selection problems. We show that many recently introduced algorithms, such as the locally informed sampler of Zanella (J Am Stat Assoc 115(530):852–865, 2020), the locally informed with thresholded proposal of Zhou et al. (Dimension-free mixing for high-dimensional Bayesian variable selection, 2021) and the adaptively scaled individual adaptation sampler of Griffin et al. (Biometrika 108(1):53–69, 2021), can be viewed as particular cases within the framework. We then describe a novel algorithm, the adaptive random neighbourhood informed sampler, which combines ideas from these existing approaches. We show using several examples of both real and simulated data-sets that a computationally efficient point-wise implementation (PARNI) provides more reliable inferences on a range of variable selection problems, particularly in the very large p setting.

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

1 Introduction

Despite their long history, linear regression models remain a key building block of many present-day statistical analyses. In the modern setting, practitioners not only show interest in making good predictions but also intend to investigate underlying low-dimensional structure based on the belief that only a small subset of predictors play a crucial role in predicting the response. These problems can be addressed by variable selection. A variable selection method is an automatic procedure that selects the best (small) subset of covariates that explains most of the variation in the response (Chipman et al. 2001). Frequentist approaches focus on model comparisons through information criteria or point estimates, using e.g. maximum penalised likelihood under sparsity assumptions (Hastie et al. 2015). Alternatively the Bayesian approach can be taken by imposing an appropriate prior on all possible models and computing the posterior.

We consider Bayesian variable selection (BVS) with spike-and-slab priors (Mitchell and Beauchamp 1988), which lead to natural uncertainty measures such as posterior model probabilities and marginal posterior variable inclusion probabilities. Suppose a linear regression model with p candidate covariates is given, we focus on a random variable $\gamma \in \Gamma = \{0,1\}^p$ where $\gamma _j=1$ indicates that the j-th covariate is included in the model. The exact posterior distribution of $\gamma $ is challenging to compute, and when $p>30$ Markov chain Monte Carlo (MCMC) algorithms are typically used to estimate posterior summaries of interest (George and McCulloch 1993; Chipman et al. 2001). Garcia-Donato and Martinez-Beneito (2013) discuss the use of the Gibbs sampler whereas Madigan et al. (1995) ($\text {MC}^3$) and Brown et al. (1998) (Add-Delete-Swap) propose random-walk Metropolis-Hastings algorithms. Yang et al. (2016) provide conditions on the the Add-Delete-Swap algorithm for rapid mixing in the sense that the mixing time grows at most polynomially in p under some mild conditions on the posterior distributions. These approaches can, however, suffer from an unexpectedly long mixing time and therefore slow convergence when p is large. For this reason, alternative informed MCMC schemes have gained popularity for problems with discrete parameter spaces (having already achieved prominence in the continuous setting). Informed MCMC schemes are those in which the Metropolis-Hastings proposal exploits some information about the target distribution. Intuitively, the success of informed proposals relies on avoiding models with low posterior model probabilities (Zhou et al. 2021). Titsias and Yau (2017) describe the Hamming ball sampler (HBS) in which models are proposed in proportion to their locally-truncated posterior probability within a Hamming ball neighbourhood. Zanella and Roberts (2019) consider a Tempered Gibbs sampler (TGS), which involves importance sampling and more frequently updates components with lower conditional distributions. A more general class of locally informed and balanced proposals is introduced by Zanella (2020). These locally balanced proposals can be obtained by weighting a base kernel using a balancing function, which is a function of the posterior distribution that satisfies a certain functional property. The base kernel is typically concentrated on a neighbourhood of the current state, resulting in a proposal that is informed and balanced using “local” information about the posterior. The author shows that a random walk proposal is asymptotically dominated by its locally balanced counterpart in the Peskun sense as dimensionality increases under mild conditions on the target distribution (Peskun 1973; Tierney 1998). Zhou et al. (2021) present a Locally Informed and Thresholded proposal (LIT) which replaces the balancing function by a thresholded weighting function (i.e. a thresholding function). The LIT scheme is closely connected to the locally balanced proposal because the thresholding function behaves like a flexible composition of globally and locally balanced functions. This novel scheme has been shown to have a dimension-free mixing time bound under similar conditions as in Yang et al. (2016). For other developments concerning locally informed proposals, see e.g. Livingstone and Zanella (2019); Gagnon (2021); Power and Goldman (2019).

Since the posterior distribution is discrete-valued, the above random-walk or informed MCMC schemes can be viewed as neighbourhood samplers. A neighbourhood sampler is an MCMC scheme which can be decomposed into two stages: (i) construct a neighbourhood that is a set of states (models) around the current state (model); (ii) propose a new state (model) within the neighbourhood constructed in stage (i). For example, the $\text {MC}^3$ and locally balanced schemes propose a new model on an identical neighbourhood which consists of models which only differ from the current model in 1 position (i.e. a Hamming neighbourhood), whereas their second stage is a random walk and an informed proposal respectively. The LIT algorithm of Zhou et al. (2021) is similar to the locally balanced scheme whereas it takes an identical neighbourhood generation mechanism to an Add-Delete-Swap scheme but its second stage uses a thresholding function. The design of the neighbourhoods is a crucial factor to the performance of MCMC schemes, especially in those informed schemes for two major reasons. The first reason is the “quality” of neighbourhood in the sense that we should generate neighbourhoods including many promising models. Encouraging better quality neighbourhood construction will improve the mixing of the chain and avoid it getting struck in some low probability models. The second reason is the size of the neighbourhood. Informed MCMC schemes often mix quickly and have good convergence properties, but the computation of each transition can be prohibitively expensive. For example, the number of models in the locally balanced proposal will be at least linear in p and will tend to include large numbers of unimportant variables under standard sparsity assumptions. Neighbourhoods have also been considered previously in the context of stochastic search. Hans et al. (2007) describe a novel Shotgun Stochastic Search (SSS) algorithm whilst Chen et al. (2016) consider a paired-move multiple-try stochastic search algorithm. Both schemes identify a subset of probable models and move to new models within the neighbourhood according to posterior model probabilities.

In this paper we propose a method which generates good neighbourhoods while controlling computational cost with large p by introducing a framework for constructing flexible and efficient MCMC algorithms based on random neighbourhoods. We refer to the scheme as a random neighbourhood sampler and show that if they are well-constructed such schemes can lead to Markov chains with good convergence properties and controlled computational cost per iteration. Our method uses an adaptive scheme to achieve a flexible neighbourhood generating mechanism. Adaptive MCMC is a sub-class of algorithms in which tuning parameters are automatically updated “on the fly” (e.g. Andrieu and Thoms 2008). Several adaptive methods have been developed in the context of BVS (Ji and Schmidler 2013; Lamnisos et al. 2009, 2013). We build on Griffin et al. (2021) who develop the Adaptively-Scaled Individual Adaptation sampler (ASI), which is able to adapt to the importance of each candidate covariate and propose multiple swaps per iteration in high-dimensional settings. We show that the ASI algorithm is a random neighbourhood sampler whose second stage is a random-walk proposal in this paper. Based on this discovery, we design a random neighbourhood informed sampler with the same neighbourhood generating mechanism as ASI but replace its second stage by an informed within-neighbourhood proposal. To illustrate the power of the framework, we develop a new MCMC algorithm for Bayesian variable selection in linear regression, namely the Point-wise Adaptive Random Neighbourhood Informed (PARNI) sampler. This combines the strengths of ASI for good neighbourhood generation and locally informed proposals for avoiding random walk behaviour. An extensive set of empirical results on both real and simulated data-sets show that the PARNI sampler yields good estimates for posterior quantities of interest and performs particularly well for well-known large p examples such as the PCR ($p=22,575$) and SNP ($p=79,748$) data-sets.

The rest of this paper is structured as follows. In Sect. 2, we review BVS for the linear model along with prior specification. We also briefly describe both the ASI scheme of Griffin et al. (2021) and the locally informed methods of Zanella (2020) and Zhou et al. (2021). In Sect. 3, we characterise the construction of random neighbourhood proposals and illustrate that locally informed proposals and the ASI scheme fall within this framework. Section 4 presents the construction of adaptive random neighbourhood and informed samplers. Following this structure, we present the ARNI and PARNI samplers. In addition, we establish both the ergodicity and a strong law of large numbers for the PARNI algorithm. We implement the PARNI sampler in Sect. 5 on both simulated and real data. Comparisons between the PARNI samplers and other state-of-the-art MCMC algorithms are carried out to showcase their capacity and efficiency. In Sect. 6 we discuss limitations and possible future work. Detailed explanations and proofs are provided in the supplement.

2 Background

2.1 Bayesian variable selection for the linear regression model

Consider a data-set $\{(y_i,x_{i1},...,x_{ip})\}_{i=1}^n$, where the vector $y = (y_1,...,y_n) \in \mathbb {R}^n$ is called the response variable and each $x_j = (x_{1j},...,x_{nj})$ is one of p predictor variables or covariates. The variable selection problem is concerned with finding the best $q \ll p$ covariates that are most associated with the response. Assuming that each regression includes an intercept, then there are $2^p$ possible models that can be formulated to predict the response. We refer to each model as $M_{\gamma }$ where the models are indexed by the indicator variable $\gamma = (\gamma _1, \ldots , \gamma _p) \in \Gamma = \{0,1\}^p$, where $\gamma _j = 1$ if the j-th variable is included in model $M_\gamma $ and $\gamma _j = 0$ otherwise. We refer to $\Gamma $ as model space and let $p_\gamma := \sum _j \gamma _j$. The model $M_\gamma $ associated with $\gamma $ is then

$$\begin{aligned} y = \alpha {\textbf {1}}_n + X_\gamma \beta _\gamma + \epsilon \end{aligned}$$

(1)

where $\epsilon \sim N_n(0, \sigma ^2 I_n)$, y is an n-dimensional response vector, $X_\gamma $ is an $(n \times p_\gamma )$ design matrix which consists of the “active” variables in $\gamma $ (those for which $\gamma _j = 1$), $\alpha $ is an intercept term and $\beta _\gamma \in \mathbb {R}^{p_\gamma }$. In the Bayesian framework, we consider a commonly-used conjugate prior specification

$$\begin{aligned}&p(\alpha ) \propto 1, \quad \beta _\gamma |\gamma , \sigma ^2 \sim N(0, g \sigma ^2 V_\gamma ), \quad p(\sigma ^2) \propto \sigma ^{-2}, \quad \\&\quad p(\gamma ) = h^{p_{\gamma }} (1-h)^{p-p_{\gamma }}. \end{aligned}$$

For simplicity, we can remove the intercept term $\alpha $ by centering y and $X_j$ for all j. Chipman et al. (2001) highlight that this method can be motivated from a formal Bayesian perspective by integrating out the coefficients corresponding to those fixed regressors with respect to an improper uniform prior. The covariance matrix $V_\gamma $ is often chosen as $(X_\gamma ^T X_\gamma )^{-1}$ (a g-prior) or identity matrix $I_{p_\gamma }$ (an independent prior). In what follows, we will focus on the independence prior where $V_\gamma = I_{p_\gamma }$. For both of these choices, the marginal likelihood $p(y|\gamma )$ is analytically tractable. Suitable values for the global scale parameter g are suggested in Fernandez et al. (2001). It can also be driven by a hyperprior, yielding a fully Bayesian model (see Liang et al. (2008) for details). The hyperparameter $h \in (0,1)$ is the prior probability that each variable is included in the model. Steel and Ley (2007) suggest against using fixed h unless strong information is given, and instead placing a hyperprior on it such as a Beta prior $h \sim \text {Beta}(a,b)$, leading to a Beta-binomial prior on the model size. The choices of g and h will be specified later for each set of data. In the following sections, we will develop efficient sampling schemes targeting the posterior distribution $\pi (\gamma ) \propto p(y|\gamma )p(\gamma )$.

Remark 1

For a linear regression model with p candidate covariates, it has been shown that spike-and-slab priors often lead to posterior consistency in the sense that the posterior collapses to a Dirac measure on the true model as more observations are gathered (Fernandez et al. 2001; Liang et al. 2008; Yang et al. 2016), even in high-dimensional setting where p grows with n (Shang and Clayton 2011; Narisetty and He 2014). Another approach is to employ continuous shrinkage priors (e.g. Polson and Scott 2010; Griffin and Brown 2021), which only give posterior inference on regression coefficients but can result in a more computationally tractable posterior distribution.

2.2 Adaptively scaled individual adaptation algorithm

Griffin et al. (2021) introduce a scalable adaptive MCMC algorithm targeting high-dimensional BVS posterior distributions together with a method that automatically updates the tuning parameters. They consider the class of proposal kernels

$$\begin{aligned} q_{\eta } (\gamma , \gamma ^\prime ) = \prod _{j=1}^p q_{\eta , j}(\gamma _j, \gamma _j^\prime ) \end{aligned}$$

(2)

where $\eta = (A, D) = (A_1, \ldots , A_p, D_1, \ldots , D_p)$, $q_{\eta , j}(\gamma _j=0, \gamma _j^\prime =1) = A_j$ and $q_{\eta , j}(\gamma _j = 1, \gamma _j^\prime = 0) = D_j$, with Metropolis-Hastings acceptance probability

$$\begin{aligned} \alpha _\eta (\gamma , \gamma ^\prime ) = \left\{ 1, \frac{\pi (\gamma ^\prime )q_{\eta }(\gamma ^\prime , \gamma )}{\pi (\gamma )q_{\eta }(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$

(3)

This proposal mainly benefits from two aspects. Firstly, the flexibility offered by 2p tuning parameters allows the proposal to be tailored to the data. Secondly, this form of proposal also allows multiple variables to be added or deleted from the model in a single iteration, which in turn allows the algorithm to make large jumps in model space.

Griffin et al. (2021) suggest an optimal choice of $\eta = (A, D)$ in Peskun sense while assuming that all variables are independent. If $\pi _j$ denotes the posterior inclusion probability of the j-th regressor, the optimal choice of $\eta ^{\text {opt}} = (A^{\text {opt}}, D^{\text {opt}})$ is given as

$$\begin{aligned} A^{\text {opt}}_j = \min \left\{ 1, \frac{\pi _j}{1-\pi _j}\right\} , \quad D^{\text {opt}}_j = \min \left\{ 1, \frac{1-\pi _j}{\pi _j}\right\} . \end{aligned}$$

(4)

The independence assumption is usually violated due to the correlation between regressors and therefore a scaled proposal with parameters $\eta = \zeta \eta ^{\text {opt}}$ for a scaling parameter $\zeta \in (0,1)$ is suggested. This scaling parameter $\zeta $ controls the number of variables that differ between the current state $\gamma $ and the proposed state $\gamma '$. Smaller values of $\eta $ can be used to avoid overly ambitious moves with low probabilities of acceptance and so control the average acceptance rate. They also suggest multiple chain acceleration with common adaptive parameters since running multiple independent chains with shared adaptive parameters can facilitate the convergence of the adaptive parameters (Craiu et al. 2009). This phenomenon is demonstrated in their simulation studies where the schemes with 25 multiple chains outperform the schemes with only 5 multiple chains in terms of relative efficiency especially for large p data-sets. Suppose L chains are used and let $\gamma ^{l,(i)}$ and $\gamma ^{l,\prime }$ denote the current state and proposal for the l-th chain respectively. We also defined a vector $\gamma _{-j} = (\gamma _1, \ldots ,\gamma _{j-1}, \gamma _{j+1},\ldots ,\gamma _{p})$ to denote the vector of $\gamma $ without $\gamma _j$. The tuning parameters of the proposal are updated on the fly using a Rao-Blackwellised estimate of the posterior inclusion probability of the j-th regressor which, at the N-th iteration, is

$$\begin{aligned} {\hat{\pi }}^{(N)}_j = \frac{1}{NL} \sum _{i=1}^N \sum _{l=1}^L \frac{\pi (\gamma _j = 1, \gamma ^{l,(i)}_{-j}|y)}{\pi (\gamma _j = 1, \gamma ^{l,(i)}_{-j}|y) + \pi (\gamma _j = 0, \gamma ^{l,(i)}_{-j}|y)}. \end{aligned}$$

(5)

The use of the Rao-Blackwellised estimates of the posterior inclusion probabilities can swiftly distinguish unimportant variables. Griffin et al. (2021) show how these Rao-Blackwellised estimates can be calculated in $\mathcal {O}(p)$ operations which leads to a scalable MCMC scheme in large p BVS problems. At the i-th iteration, the proposal parameters are $\eta = \zeta ^{(i)} \times \eta ^{(i)}$ where $\eta ^{(i)} = (A^{(i)}, D^{(i)})$,

$$\begin{aligned} A^{(i)}_j = \min \left\{ 1, \frac{{\hat{\pi }}^{(i)}_j}{1-{\hat{\pi }}^{(i)}_j}\right\} , \quad D^{(i)}_j = \min \left\{ 1, \frac{1-{\hat{\pi }}^{(i)}_j}{{\hat{\pi }}^{(i)}_j}\right\} \end{aligned}$$

(6)

and the scaling parameter $\zeta ^{(i)}$ is tuned using the Robbins-Monro scheme

$$\begin{aligned} \text {logit}_\epsilon \zeta ^{(i+1)} = \text {logit}_\epsilon \zeta ^{(i)} + \frac{\phi _i}{L} \sum _{l=1}^L (\alpha _{\zeta ^{(i)} \eta ^{(i)}}(\gamma ^{l,(i)},\gamma ^{l,\prime }) - \tau ) \end{aligned}$$

(7)

for a target rate of acceptance $\tau $ and the mapping $\text {logit}_\epsilon :(\epsilon , 1-\epsilon ) \rightarrow \mathbb {R}$ is a modified logistic function (or logit function) defined by

$$\begin{aligned} \text {logit}_\epsilon (x) = \log (x-\epsilon ) - \log (1-x-\epsilon ) \end{aligned}$$

(8)

for some small $\epsilon \in (0, 1/2)$. The full description of the sampler is given in Algorithm 1. The resulting algorithm is called Adaptively Scaled Individual Adaptation (ASI). Griffin et al. (2021) establish the $\pi $-ergodicity and a strong law of large numbers for the ASI sampler.

Remark 2

The performance of the ASI algorithm is crucially related to the choice of appropriate values of parameters and hyperparameters. The parameters $\eta ^{(i)}$ and $\zeta ^{(i)}$ are updated on the fly. The hyperparameters are chosen as follows: $\phi _i = i^{-0.7}$, $\tau = 0.234$, $\epsilon = 0.1/p$ and $\pi _0 = 0.001$. This hyperparameter specification is suggested by Griffin et al. (2021) and they shows that it works well in general based on the empirical performances. See their paper for the discussion on the choice of hyperparmeters.

2.3 Locally informed proposals for discrete-valued variables

In continuous sample space, MCMC algorithms often utilise gradients of the target distribution, e.g. the Metropolis-adjusted Langevin algorithm (Grenander and Miller 1994) and Hamiltonian Monte Carlo (Duane et al. 1987). These methods are defined on continuous spaces but Zanella (2020) develop a class of informed proposals as an analog for discrete spaces. The approach assumes that we can define a random walk Metropolis proposal kernel Q on a neighbourhood $N \subset \Gamma $ with mass function q. In this paper, we consider the following construction of informed proposals that are described by Zanella (2020) as follows

$$\begin{aligned} q_g(\gamma , \gamma ^\prime ) = {\left\{ \begin{array}{ll} \frac{g\left( \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\right) q(\gamma , \gamma ^\prime )}{Z_g(\gamma )}, \quad &{} \gamma \in N \\ 0, \quad &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(9)

where $g:[0,\infty )\rightarrow [0,\infty )$ is a monotone continuous weighting function and $Z_g(\gamma )$ is a normalising constant such that

$$\begin{aligned} Z_g(\gamma ) = \sum _{\gamma ^\prime \in N} g\left( \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\right) q(\gamma , \gamma ^\prime ). \end{aligned}$$

(10)

The choice of the weighting function g is crucial for the performance of $Q_g$ since it determines how the target distribution $\pi $ drives the proposal. When g is the constant function $g(t) = 1$, the resulting informed proposal $Q_g$ will coincide with the base kernel Q and this is referred to as a non-informed proposal. Zanella (2020) mainly discussed the locally balanced proposals which are formed by the balancing functions that satisfy $g(t) = tg(1/t)$ for all $t > 0$. The locally balanced proposals are approximately $\pi $-reversible if Q is restricted to local moves. The neighbourhoods are normally chosen to be $N = \mathcal {H}_m(\gamma ) := \{\gamma ^\prime \in \Gamma | d_H(\gamma ^\prime , \gamma ) \le m\}$ for which $d_H(\cdot ,\cdot )$ denotes the measure of Hamming distance (i.e. $d_H(\gamma , \gamma ^\prime ) = \sum _{j=1}^p |\gamma _j - \gamma _j^\prime |$) and the proposal kernel Q would be a uniform distribution on the neighbourhood N. When m is taken to be 1, the base kernel Q is identical to the $\text {MC}^3$ sampler. In addition, taking g as the identity function (i.e. $g(t) = t$) will lead to a globally balanced proposal $Q_g$ where $Q_g$ is $\pi $-reversible when the neighbourhood N is the whole sample space.

Theorem 5 of Zanella (2020) shows that using a uniform based kernel on neighbourhood $\mathcal {H}_m(\gamma )$ combined with a balancing function g as described above will be asymptotically optimal relative to the un-informed or globally balanced proposals, in terms of Peskun ordering, as the dimensionality goes to infinity under the condition that $\sup _{\gamma \in \Gamma ,\gamma ^\prime \in N} Z_g(\gamma )/Z_g(\gamma ^\prime ) \rightarrow 1$ holds. However, for a Bayesian variable selection problem, Zhou et al. (2021) argue that the behavior of the function $\gamma \rightarrow Z_g(\gamma )$ is difficult to predict and the assumption may not hold. They therefore suggest a modified weighting function with upper- and lower-bounds

$$\begin{aligned} g(t) = \min \{\max \{p^l, t\}, p^L\} \end{aligned}$$

(11)

where p is the total number of regressors and $-\infty<l< L < \infty $ are some constants. In what follows, the weighting function in (11) is referred to as the thresholding function. The thresholding function is flexible in the sense that it includes globally and locally balanced functions for specific values of l and L.

Their Locally informed with Thresholded (LIT) algorithm works on neighbourhoods derived from the Add-Delete-Swap scheme and allows the values of l and L to change with the type of move. Under the conditions that the posterior mass concentrations on a small set and the chain starts at a model that is not too far from the true data-generating model, they prove that the LIT algorithm can achieve a dimension-free mixing rate if the parameters of the LIT algorithm are properly selected.

3 Random neighbourhood samplers and the ASI algorithm

Let us recall the idea of a neighbourhood sampler from Sect. 1. In general, the neighbourhoods can be random and tailored to the target distribution $\pi $. This is referred to as a random neighbourhood sampler. In this section, we will properly present the random neighbourhood sampler in detail and show using Theorem 1 that the ASI sampler is a random neighbourhood sampler.

3.1 Random neighbourhood samplers

We consider a framework for constructing Metropolis-Hastings proposals to sample from $\pi (\gamma )$ in which a new state is proposed within a random neighbourhood around the current state. The random neighbourhoods are generated using an auxiliary variable k as a neighbourhood indicator. This auxiliary variable k is a discrete random variable defined on a countable set $\mathcal {K}$ such the probability of generating a neighbourhood $N = N(\gamma , k)$ is the same as the probability of generating k (i.e. $p(N|\gamma ) = p(k|\gamma )$). Suppose $\gamma $ is the current state and $Q_k$ is a Metropolis-Hastings proposal kernel (conditioned on k) with mass function $q_k$. A new state $\gamma ^\prime $ is drawn from kernel $Q_k$ after a value of k has been generated. In updating k at each iteration, we usually consider proposing a new state $k^\prime $ conditional on the current state k through a deterministic bijection $\rho :\mathcal {K} \rightarrow \mathcal {K}$ such that $k^\prime = \rho (k)$. The mapping $\rho $ should be an involution which is a self-inverse function which satisfies $\rho (\rho (k)) = k$. We call an MCMC algorithm that uses the above construction to generate Metropolis-Hastings proposals a random neighbourhood sampler. The followings are some examples of random neighbourhood samplers.

Example 1

(Samplers with non-stochastic neighbourhoods)

In fact, samplers with non-stochastic neighbourhoods are also random neighbourhood samplers where the specific neighbourhoods are generated with constant probability of 1 at each state $\gamma $. In such cases, the choices of k and $\rho $ can be arbitrary. For instance, the $\text {MC}^3$ sampler can be viewed as a random neighbourhood sampler for which the neighbourhood N consists of models that are 1-Hamming distance from $\gamma $. In particular, the locally balanced samplers of Zanella (2020) also belong to this class with neighbourhood N as defined in Sect. 2.3.

Example 2

(Add-Delete-Swap sampler and LIT proposal)

In each iteration of an Add-Delete-Swap (ADS) sampler, a strategy from “addition”, “deletion” and “swap” is uniformly chosen which implies that the auxiliary variable k is uniformed distributed over the sample space $\mathcal {K} = \{\text {``addition''}, \text {``deletion''}, \text {``swap''}\}$ and therefore construct a neighbourhood $N(\gamma , k)$ as in Yang et al. (2016). A new state $\gamma ^\prime $ is uniformly proposed from $N(\gamma , k)$. The corresponding mapping $\rho $ is then a function that sends the auxiliary variable to an opposite strategy, e.g. it sends “addition” to “deletion” and vice versa. Note that the opposite strategy of “swap” is itself. The Locally Informed and Thresholded (LIT) proposal by Zhou et al. (2021) has an identical neighbourhood construction to an ADS sampler but it proposes a new model using an informed weighted proposal that uses weighting functions bounded above and below.

Example 3

(Hamming ball sampler)

A Hamming ball sampler with radius m is described by Titsias and Yau (2017). This algorithm selects a neighbourhood from $\mathcal {H}_m(\gamma ) \subset \Gamma $, which is the set of states at most m-Hamming distance away from $\gamma $. The auxiliary variable k is equivalent to U in their design in which k is uniformly distributed over the set $\mathcal {K} = \mathcal {H}_m(\gamma )$ and a neighbourhood $N(\gamma , k) = \mathcal {H}_m(k)$ is used to draw a new state. The Hamming ball sampler proposes a new state according to the truncated posterior model probability in the neighbourhood $N(\gamma , k)$. In this scheme, the mapping $\rho $ is the identity function, meaning the same auxiliary variable is used in reversed moves.

The full update of a random neighbourhood sampler uses the three stages below:

(i)
(Neighbourhood construction) Sample a neighbourhood indicator k from $p(\cdot |\gamma )$, and construct the corresponding neighbourhood $N(\gamma , k)$;
(ii)
(Within-neighbourhood proposal) Propose a new model $\gamma ^\prime $ in $N(\gamma , k)$ according to $Q_k(\gamma , \cdot )$;
(iii)
(Accept/reject step) Calculate the probability of the reverse move, $q_{\rho (k)}(\gamma ^\prime , \gamma )$, by constructing the reverse neighbourhood $N(\gamma ^\prime , \rho (k))$. Move to the new state $\gamma ^\prime $ with probability $\alpha _k(\gamma , \gamma ^\prime )$ where $\alpha _k(\gamma , \gamma ^\prime )$ is the Metropolis-Hastings acceptance probability
$$\begin{aligned} \alpha _k(\gamma , \gamma ^\prime ) = \min \left\{ 1, \frac{\pi (\gamma ^\prime )p(\rho (k)|\gamma ^\prime )q_{\rho (k)}(\gamma ^\prime , \gamma )}{\pi (\gamma )p(k|\gamma )q_k(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$
(12)

Throughout this article, we refer to the above three stages as neighbourhood construction, within-neighbourhood proposal and accept/reject step respectively. To preserve the reversibility of the chain, it is better to design a neighbourhood generation scheme where the law

$$\begin{aligned} \gamma ^\prime \in N(\gamma , k) \iff \gamma \in N(\gamma ^\prime , \rho (k)) \end{aligned}$$

(13)

holds for any $\gamma $, $\gamma ^\prime $ and k. Upon this law, we assume that the condition

$$\begin{aligned} p(k|\gamma )q_k(\gamma , \gamma ^\prime )> 0 \iff p(\rho (k)|\gamma )q_{\rho (k)}(\gamma ^\prime , \gamma ) > 0 \end{aligned}$$

(14)

is satisfied. This assumption is a generalisation of the paired-move strategy in Chen et al. (2016) and it results in the correctness and reversibility of such a scheme through the following proposition.

Proposition 1

A random neighbourhood sampler is $\pi $-reversible provided that condition (14) holds, $p(k|\gamma )$ is a valid probability measure on $\mathcal {K}$ and $q_k(\gamma , \gamma ^\prime )$ is a valid probability measure on neighbourhood $N(\gamma , k)$ for all $\gamma \in \Gamma $ and $k \in \mathcal {K}$.

Remark 3

To generalise the framework of random neighbourhood samplers, it is possible to use a continuous auxiliary variable k. In such a case, the acceptance probability in (12) should include the Jacobian term.

We show in the next part that the ASI sampler is also a random neighbourhood sampler. Unlike the locally balanced proposals, it focuses on constructing sophisticated random neighbourhoods which are more likely to contain promising models and employs a random walk within-neighbourhood proposal.

3.2 Another take on the ASI scheme

It is not straightforward to observe that the ASI sampler is a random neighbourhood sampler, however we show below that, in fact, it can be. To do so, we introduce a random neighbourhood sampler, the Adaptive Random Neighbourhood (ARN) sampler, and prove that the ARN and ASI samplers are equivalent if they share some common adaptive parameters. The ARN sampler follows a random walk within neighbourhood but, compared to the locally informed approach, puts more efforts into neighbourhood construction.

We consider a random neighbourhood sampler with algorithmic tuning parameter $\theta = (\xi \eta ^{\text {opt}}, \omega ) \in (\epsilon , 1-\epsilon )^{2p+1} := \Delta _\epsilon ^{2p+1}$ and a small $\epsilon \in (0, 1/2)$, where $\eta ^{\text {opt}}$ is given in (4), and the tuning parameters $\xi $ and $\omega $ are used in the random neighbourhood construction and the within-neighbourhood proposal respectively. In the random neighbourhood construction, the neighbourhood indicator variable $k = (k_1, \ldots , k_p) \in \mathcal {K} = \{0,1\}^p$ is generated from the distribution

$$\begin{aligned} p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k| \gamma ) =&\prod _{j = 1}^p p^{\text {RN}}_{\xi \eta ^{\text {opt}}, j}(k_j| \gamma _j) \end{aligned}$$

(15)

where $p^{\text {RN}}_{\xi \eta ^{\text {opt}},j}(k_j=1|\gamma _j=0) = \xi A_j^{\text {opt}}$ and $p^{\text {RN}}_{\xi \eta ^{\text {opt}},j}(k_j=1|\gamma _j=1) = \xi D_j^{\text {opt}}$. This is equivalent to the ASI proposal in (2) where $k_j=1$ if and only if $\gamma _j \ne \gamma _j^\prime $. A neighbourhood $N(\gamma , k)$ is obtained from $\gamma $ and k for which $\gamma $ is the “centre” of $N(\gamma , k)$ and k indicates the possible indices altered from $\gamma $. These tuning parameters $\xi $ and $\eta ^{\text {opt}}$ are abortively updated on the fly. For any $\gamma ^* \in N(\gamma , k)$, $k_j = 0$ implies that $\gamma ^*_j = \gamma _j$. This identity can be used to state a formal definition of the neighbourhood $N(\gamma , k)$ as

$$\begin{aligned} N(\gamma , k) = \{ \gamma ^* \in \Gamma | \gamma _j = \gamma _j^*, ~ \forall k_j = 0 \}. \end{aligned}$$

The neighbourhood contains $2^{p_k}$ models where $p_k$ is the number of 1s in k (i.e. $p_k := \sum _{j=1}^p k_j$). The parameter $\xi $ affects the $p_k$ and therefore controls neighbourhood size. So we call $\xi $ the neighbourhood scaling parameter.

The mapping $\rho $ is chosen to be the identity function. The within-neighbourhood proposal in this adaptive random neighbourhood scheme is also based on the same proposal in (2) over the neighbourhood $N(\gamma , k)$. It can be characterised as choosing the variables to be added or deleted from the model by thinning from within the set $\{j | ~ k_j = 1 \}$ with the thinning probability set to be $\omega \in (0,1)$. We refer to this parameter as the unique within-neighbourhood proposal tuning parameter. A larger value of $\omega $ increases the probability of proposing $\gamma ^\prime $ further away from $\gamma $ in Hamming distance. This can be written formally as the proposal in (2) with tuning parameter $\eta ^{\text {THIN}} = (A^{\text {THIN}}, D^{\text {THIN}}) = (\omega k,\omega k)$, that is $A^{\text {THIN}}_j = D^{\text {THIN}}_j = \omega $ for $k_j = 1$ and $A^{\text {THIN}}_j = D^{\text {THIN}}_j = 0$ otherwise. The resulting proposal is termed as $q^{\text {THIN}}_{\omega , k}$ which is formulated as

$$\begin{aligned} q_{\omega , k}^{\text {THIN}}(\gamma , \gamma ^\prime ) = \prod _{j=1}^p q_{\omega , k_j}^{\text {THIN}}(\gamma _j, \gamma ^\prime _j), \end{aligned}$$

(16)

where $q_{\omega , 1}^{\text {THIN}}(\gamma _j, 1-\gamma _j) = \omega $ and $q_{\omega , 0}^{\text {THIN}}(\gamma _j, 1-\gamma _j) = 0$. The proposal $q_{\omega , k}^{\text {THIN}}$ is symmetric and only generates new states inside the neighbourhood $N(\gamma , k)$. This is because the probabilities of proposing flips on coordinates other than j such that $k_j = 1$ are 0. The scheme is completed by accepting or rejecting the proposal using a standard Metropolis-Hastings acceptance probability

$$\begin{aligned} \alpha _{\theta , k}^{\text {ARN}} (\gamma , \gamma ^\prime ) = \left\{ 1, \frac{\pi (\gamma ^\prime ) p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime ) q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma ) }{\pi (\gamma ) p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) } \right\} . \end{aligned}$$

(17)

Remark 4

An alternative formulation to (16) in terms of Hamming distance between $\gamma $ and $\gamma ^\prime $ is

$$\begin{aligned} q_{\omega , k}^{\text {THIN}}(\gamma , \gamma ^\prime )&= \omega ^{d_H(\gamma , \gamma ^\prime )} (1-\omega )^{p_k - d_H(\gamma , \gamma ^\prime )} \mathbb {I}\{ \gamma ^\prime \in N(\gamma , k)\} \nonumber \\&= \left( \frac{\omega }{1-\omega }\right) ^{d_H(\gamma , \gamma ^\prime )} (1-\omega )^{p_k} \mathbb {I}\{ \gamma ^\prime \in N(\gamma , k)\} \end{aligned}$$

(18)

where $d_H(\gamma , \gamma ^\prime )$ is the measure of Hamming distance between two models $\gamma $ and $\gamma ^\prime $.

Remark 5

When $\omega $ is chosen to be 1/2, the within-neighbourhood proposal $q^{\text {THIN}}_{\omega = 1/2, k}$ is uniformly distributed over the local neighbourhood $N(\gamma , k)$.

Algorithm 2 describes how a new state $\gamma ^\prime $ is proposed using the ARN scheme. We indicate the transition kernel by $p^{\text {ARN}}_{\theta }$ and the corresponding sub-transition kernel conditional on k by $p^{\text {ARN}}_{\theta , k}$. They obey the relationship

$$\begin{aligned} p^{\text {ARN}}_{\theta }(\gamma , \gamma ^\prime ) = \sum _{k \in \mathcal {K}} p^{\text {ARN}}_{\theta , k}(\gamma , \gamma ^\prime ). \end{aligned}$$

The following proposition helps to show that the ARN sampler is $\pi $-reversible.

Proposition 2

For any tuning parameter $\theta = (\eta , \omega ) \in \Delta _\epsilon ^{2p+1} = (\epsilon , 1-\epsilon )^{2p+1}$, the condition (14) holds, the conditional distribution of k, $p^{\text {RN}}_\eta (k|\gamma )$, within the ARN sampler is a valid distribution on $\mathcal {K} = \{0,1\}^p$. In addition, for any $\gamma \in \Gamma $ and $k \in \mathcal {K}$, the within-neighbourhood proposal of the ARN sampler $q^{\text {THIN}}_{\omega , k}(\gamma ,\gamma ^\prime )$ is also a valid probability distribution on $N(\gamma ,k)$.

Proposition 1 together with Proposition 2 show that the ARN transition kernel is $\pi $-reversible and therefore generates samples that preserve the target distribution $\pi $. In fact ARN and ASI are mathematically equivalent provided that the tuning parameter choices are made in a prescribed manner. To see this suppose that the tuning parameters of both the ARN and ASI schemes are fixed and share the same tuning parameter $\eta $. The following theorem shows that their transition probabilities from $\gamma $ to $\gamma ^\prime $ are equal when $\zeta = \xi \times \omega $ holds.

Theorem 1

Suppose that $\eta \in \Delta _\epsilon ^{2p}$ and $\zeta $, $\xi $, $\omega \in \Delta _\epsilon $ for small $\epsilon \in (0, 1/2)$, $p^{\mathrm {ARN}}_{(\xi \eta ,\omega )}$ and $p^{\mathrm {ASI}}_{\zeta \eta }$ are transition kernels of the ARN and ASI schemes respectively. If $\zeta = \xi \times \omega $ and, then

$$\begin{aligned} p^{\mathrm {ARN}}_{(\xi \eta ,\omega )} (\gamma , \gamma ^\prime ) = p^{\mathrm {ASI}}_{\zeta \eta } (\gamma , \gamma ^\prime ) \end{aligned}$$

(19)

holds for any $\gamma $ and $\gamma ^\prime \in \Gamma $.

In addition we deduce the following corollary.

Corollary 1

Setting $\xi _1 \times \omega _1 = \xi _2 \times \omega _2$ implies

$$\begin{aligned} p^{\mathrm {ARN}}_{(\xi _1\eta , \omega _1)} (\gamma , \gamma ^\prime ) = p^{\mathrm {ARN}}_{(\xi _2\eta , \omega _2)} (\gamma , \gamma ^\prime ) \end{aligned}$$

for any $\gamma $ and $\gamma ^\prime \in \Gamma $.

Corollary 1 shows that two ARN kernels with different tuning parameters coincide in probability if the products of the neighbourhood scaling parameter $\xi $ and proposal thinning parameter $\omega $ are equal. This corollary also suggests that magnitudes of $\xi $ and $\omega $ can shift to each other without modifying the resulting proposal as long as their product preserves.

4 Adaptive random neighbourhood and informed samplers

It should be clear from the above discussion that both the locally informed proposals and ASI schemes can be viewed as random neighbourhood samplers, and that the former focuses on selecting good proposals within a neighbourhood, while the latter focuses on constructing neighbourhoods of models which are more likely to be accepted in the Metropolis-Hastings update. Our main methodological contribution is to design a random neighbourhood sampler for which both the neighbourhood construction and within-neighbourhood proposal are designed in an informed way. We therefore consider using an adaptive random neighbourhood approach to construct neighbourhoods, followed by a locally informed approach to select a proposal from this neighbourhood.

The advantages of combining the two schemes in this manner are worth highlighting. A key strength of ASI is that generating proposals is computationally cheap, but when components of the posterior distribution are highly correlated then the assumption of independence that is embedded into the proposal generation can lead to overly ambitious moves that will be rejected. To combat this, the scaling parameter must be used to control the acceptance rate, but in the presence of high correlation this can lead to small moves and slow mixing. The locally informed sampler can cope well with high levels of correlation in the posterior distribution, but the (un-informed) neighbourhood in high-dimensions often, either contain no sensible models, or be so large that the cost of computing all of the posterior probability of models within it becomes prohibitive. Combining the two schemes is therefore an attractive proposition, as an intelligent neighbourhood that is also not too large can be constructed using ASI, and then correlation can be controlled for at the second stage by choosing the within-neighbourhood proposal using the locally informed approach.

We give the details of this adaptive random neighbourhood and informed sampler below, which we call the Adaptive Random Neighbourhood Informed (ARNI) sampler. After this we define the point-wise ARNI (PARNI) scheme, which enjoys the benefits of ARNI but with much lower computational cost.

4.1 Adaptive random neighbourhood informed algorithm

We first describe a general construction of the random neighbourhood informed proposals. Suppose a random neighbourhood sampler is given with neighbourhood indicator variable $k \in \mathcal {K}$ and a update mapping $\rho $ together with a within-neighbourhood proposal kernel $Q_k$. The variable k follows a conditional distribution $p(k|\gamma )$ whereas the proposal $Q_k$ produces a new state $\gamma ^\prime $ within neighbourhood $N(\gamma , k)$ in an uninformed manner. We consider a class of random neighbourhood informed proposals $Q_{g,k}$ with mass function

$$\begin{aligned} q_{g,k}(\gamma , \gamma ^\prime ) = {\left\{ \begin{array}{ll} \frac{g\left( \frac{\pi (\gamma ^\prime )p(\rho (k)|\gamma ^\prime )q_{\rho (k)}(\gamma ^\prime , \gamma )}{\pi (\gamma )p(k|\gamma )q_k(\gamma , \gamma ^\prime )}\right) q_k(\gamma , \gamma ^\prime )}{Z_{g, k}(\gamma )}, &{} \gamma \in N(\gamma , k) \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(20)

where $g:[0,\infty )\rightarrow [0,\infty )$ is a continuous monotone weighting function, and $Z_{g, k}(\gamma )$ is a normalising constant defined by

$$\begin{aligned}&Z_{g, k}(\gamma )\nonumber \\&\quad = \sum _{\gamma ^*\in N(\gamma , k)} g\left( \frac{\pi (\gamma ^*)p(\rho (k)|\gamma ^*)q_{\rho (k)}(\gamma ^*, \gamma )}{\pi (\gamma )p(k|\gamma )q_k(\gamma , \gamma ^*)}\right) q_k(\gamma , \gamma ^*). \end{aligned}$$

(21)

The generated new state $\gamma ^\prime $ is accepted using the Metropolis-Hastings rule

$$\begin{aligned} \alpha _{g, k}(\gamma , \gamma ^\prime )&= \min \left\{ 1, \frac{\pi (\gamma ^\prime )p(\rho (k)|\gamma ^\prime )q_{g,\rho (k)}(\gamma ^\prime , \gamma )}{\pi (\gamma )p(k|\gamma )q_{g,k}(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$

(22)

The proposal collapses to the locally balanced proposal of Zanella (2020) when the neighbourhood is non-stochastic, the weighting function g is a balancing function that satisfies $g(t) = tg(1/t)$ and the within-neighbourhood proposal is symmetric. In what follows, we combine the above random neighbourhood informed proposal with the ARN scheme and develop an Adaptive Random Neighbourhood Informed (ARNI) proposal that uses an informed proposal at the within-neighbourhood proposal stage. In the ARNI scheme, the mapping $\rho $ is chosen to be the identity function where $\rho (k) = k$ and the within-neighbourhood proposal in Algorithm 2 is replaced by

$$\begin{aligned} q^{\text {ARNI}}_{\theta , k}(\gamma , \gamma ^\prime )&\propto g\left( \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime )q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma )q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime )}\right) \nonumber \\ \&\quad q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) \nonumber \\&= g\left( \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma )}\right) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ). \end{aligned}$$

(23)

for some weighting function g and some parameters $\theta = (\xi \eta ^{\text {opt}}, \omega ) \in \Delta _\epsilon ^{2p+1} = (\epsilon ,1-\epsilon )^{2p+1}$. The last equation follows since the within-neighbourhood proposal $q^{\text {THIN}}_{\omega , k}$ is symmetric and therefore $q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma )/q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) = 1$ for all $\gamma ^\prime \in N(\gamma , k)$. The Metropolis-Hastings acceptance probability is tailored to the new informed proposal as

$$\begin{aligned} \alpha _{\theta , k}^{\text {ARNI}}(\gamma , \gamma ^\prime ) = \min \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma ^\prime )q^{\text {ARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^{\text {opt}}}(k|\gamma )q^{\text {ARNI}}_{\theta , k}(\gamma , \gamma ^\prime )}\right\} . \end{aligned}$$

(24)

The optimal choice of informed weighting function is unclear in the ARNI scheme. The thresholding function is not appropriate since the neighbourhoods generated by ARNI cannot be divided into the addition and deletion neighbourhoods as in the LIT scheme. We therefore recommend to use a balancing function which satisfies $g(t) = tg(1/t)$ and form an ARNI balanced proposal.

To boost the convergence of these adaptive tuning parameters, the same multiple chain strategy as ASI should be implemented. In addition to the notations used in ASI, $k^{l,(i)}$ denotes the neighbourhood indicator variable for the l-chain at time i. For L multiple chains, the tuning parameters $\eta ^{\text {opt}}$ are updated following the same scheme of ASI as in (6) and (7). Two scaling parameters $\xi $ and $\omega $ can be updated using the Robbins-Monro schemes

$$\begin{aligned} \text {logit}_\epsilon \xi ^{(i+1)}&= \text {logit}_\epsilon \xi ^{(i)} + \frac{\phi _i}{L} \sum _{l=1}^L (p_{k^{l,(i)}} - s) \end{aligned}$$

(25)

$$\begin{aligned} \text {logit}_\epsilon \omega ^{(i+1)}&= \text {logit}_\epsilon \omega ^{(i)} + \frac{\phi _i}{L} \sum _{l=1}^L (\alpha ^l_i - \tau ) \end{aligned}$$

(26)

where $p_k$ is the size of k as mentioned previously, s is the target size of k, $\alpha ^l_i$ is the acceptance probability at the ith iteration for the l-th chain and $\tau $ is the target average acceptance rate.

Remark 6

For practical convenience, it is often useful to chose the diminishing sequence $\phi _i$ of the form $\phi _i = i^{-\lambda }$ for $\lambda \in (1/2, 1)$ since the condition $\phi _i = \mathcal {O}(i^{-\lambda })$ is not be violated by this choice of $\phi _i$. Choosing $\lambda > 1$ would result in finite adaptation (Roberts and Rosenthal 2007) in which the adaptation stops after a finite stopping time, and using $\lambda < 1/2$ is uncommon because of finite sample stability concerns. We therefore recommend using $\phi _i = i^{-0.7}$ for both updating schemes. See Remark 3 in Griffin et al. (2021) for further discussion.

While the informed proposal is powerful in accelerating the convergence of the chains, it also introduces extra computational costs since the posterior probabilities of all models in a neighbourhood are required. Given a k of size $p_k$, the resulting neighbourhood $N(\gamma , k)$ consists of $2^{p_k}$ models. Although it is possible to speed up the posterior calculations using Gray codes as introduced in George and McCulloch (1997), evaluating $2^{p_k}$ models is still computationally expensive when $p_k$ is very large and leads to an inefficient scheme. One way to address the issue is to tune the neighbourhood scaling parameter to generate neighbourhoods with a desired size, say let s be 5. In our experience, such control of the size of k comes at the cost of reduced exploration of the model space and the ARNI scheme fails to achieve better performance than ASI. This motivated us to develop a more efficient implementation of this approach that controls computational cost but maintains good exploration properties.

4.2 The PARNI sampler

We consider a point-wise implementation of the ARNI scheme (for short, the PARNI scheme). This approach is motivated by the block-wise implementation in Zanella (2020) and the block design strategy in Titsias and Yau (2017). The main idea is that a large neighbourhood is divided into a series of smaller blocks and the new model is proposed by sequentially adding or deleting variables in each block. The block design can lead to a significant reduction in the total number of models considered and so require less computational effort. For instance, suppose that there are $p_k$ non-zero neighbourhood indicator variables, which are divided into m equally sized blocks. The neighbourhoods generated by each block will have $2^m$ models. Working through each block to propose a new state requires evaluating $2^m p_k/m$ posterior probabilities. As the computational cost is proportional to the total number of models considered, the computational cost is largest when $m = p_k$ where the only building block is the entire neighbourhood of $N(\gamma , k)$. In contrast, the smallest computational cost occurs when $m=1$ where each block has one variable and therefore contains two models only. Throughout the section, we consider the latter block design when $m=1$ and the resulting algorithm is the PARNI sampler.

4.2.1 Main algorithm

We now formally present the PARNI algorithm and show how a new model $\gamma ^\prime $ is proposed from the current model $\gamma $. We use the same random neighbourhood construction as the ARNI scheme, in addition, the neighbourhood scaling parameter $\xi $ is set to be fixed at 1 to indicate that neighbourhood sizes are not reduced at this stage. In other words the neighbourhoods are generated with the optimal values $\eta ^{\text {opt}}$ as in (4). After a neighbourhood $N(\gamma , k)$ is sampled, we sequentially propose new models with only 1-Hamming distance differences inside $N(\gamma , k)$. We define $K = \{K_1, \ldots , K_{p_k}\} = \{j|k_j = 1\}$ to be the set of variables for which $k_j=1$ (the order of variables is random). We also define a sequence of models, $\gamma (1), \ldots , \gamma (p_k)$ and neighbourhoods, $N(1),\ldots , N(p_k)$ to sample the final proposal $\gamma ^\prime $. To introduce more flexibility, we allow different weighting functions for each sub-proposal so $p_k$ weighting functions $g_1, \ldots , g_{p_k}$ are defined. Finally, let $e(1),\ldots ,e(p)$ be the basis vector of a p-dimensional Cartesian space where $e(j)_j = 1$ and $e(j)_{j^\prime } = 0$ whenever $j^\prime \ne j$. We consider the neighbourhoods constructed according to $\gamma (j)$ and $e(K_r)$ for r from 1 to $p_k$. The first neighbourhood is $N(1)=N(\gamma , e(K_1))$ and propose a model $\gamma (1)$ from

$$\begin{aligned} q^{\text {PARNI}}_{\theta ,K_1}(\gamma , \gamma (1)) \propto {\left\{ \begin{array}{ll} g_1\left( \frac{\pi (\gamma (1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_1)|\gamma (1))}{\pi (\gamma )p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_1)|\gamma )}\right) q^{\text {THIN}}_{\omega , e(K_1)}(\gamma , \gamma (1)), \quad &{} \text { if }\gamma (1) \in N(1) \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(27)

for some algorithmic parameters $\theta = (\eta ^{\text {opt}}, \omega ) \in \Delta _\epsilon ^{2p+1}$. We repeat this process to construct the second neighbourhood $N(2) = N(\gamma (1), e(K_2))$ and propose the model $\gamma (2)$ from N(2). In general, at time r, we defined $N(r) = N(\gamma (r-1), e(K_r))$ and propose a model $\gamma (r)$ from

$$\begin{aligned} q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r)) \propto {\left\{ \begin{array}{ll} g_r\left( \frac{\pi (\gamma (r))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma (r))}{\pi (\gamma (r-1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma (r-1))}\right) q^{\text {THIN}}_{\omega , e(K_r)}(\gamma (r-1), \gamma (r)), \quad &{} \text { if }\gamma (r) \in N(r) \\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(28)

Each sub-proposal above only allows the value in position $K_r$ to change. Figure 1 provides a flowchart of the PARNI scheme which only involves enumerating at most $2p_k$ models rather than $2^{p_k}$ models in the ARNI proposal. The parameters of the proposal are $\theta = (\eta ^{\text {opt}}, \omega )$.

To construct a $\pi $-reversible chain, the probability of the reverse moves is required. These reverse moves use $K^\prime = \rho (K)$ as their auxiliary variables. The mapping $\rho $ reverses the order of elements in K so that the variable $K^\prime $ contains the same elements in K but with reverse order. The typical benefit is that it leads to identical intermediate models of forward and reverse proposals and the posterior probabilities of $p_k$ models are required instead of $2 p_k$. Suppose that $\gamma ^\prime (r)$ for $r = 0 ,\ldots , p_k$ are consecutive intermediate models used in the reverse move and $N^\prime (r)$ for $r = 0 ,\ldots , p_k$ are the neighbourhoods used in the reverse move. These models and neighbourhoods are identical to those ones used in the proposal move but with opposite order, in particular $\gamma ^{\prime }(r) = \gamma (p_k-r)$ for $r = 0 ,\ldots , p_k$ and $N^\prime (r) = N(p_k-r+1)$ for $r = 1 ,\ldots , p_k$. The second benefit is that the design leads to a simpler form of the Metropolis-Hastings probability of acceptance. Let Z(r) be the normalising constant of the r-th sub-proposal, $q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r))$, and $Z^\prime (j)$ denote the normalising constant of r-th sub-proposal in the reverse move, $q^{\text {PARNI}}_{\theta ,K^\prime _r}(\gamma ^{\prime }(r-1), \gamma ^{\prime }(r))$ with weighting functions $g^\prime _r$. We have that

$$\begin{aligned} \begin{aligned} Z(r) =&\sum _{\gamma ^* \in N(r)} g_r\left( \frac{\pi (\gamma ^*)p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma ^*)}{\pi (\gamma (r-1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K_r)|\gamma (j-1))}\right) \\&\quad q^{\text {THIN}}_{\omega , e(K_r)}(\gamma (j-1), \gamma ^*) \\ Z^\prime (r) =&\sum _{\gamma ^* \in N^\prime (r)} g^\prime _r\left( \frac{\pi (\gamma ^*)p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K^\prime _r)|\gamma ^*)}{\pi (\gamma ^\prime (r-1))p^{\text {RN}}_{\eta ^{\text {opt}}}(e(K^\prime _r)|\gamma ^\prime (j-1))}\right) \\&\quad q^{\text {THIN}}_{\omega , e(K^\prime _r)}(\gamma ^\prime (j-1), \gamma ^*). \end{aligned} \end{aligned}$$

(29)

We let $q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )$ be the full proposal kernel that satisfies

$$\begin{aligned} q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \prod _{r=1}^{p_k} q^{\text {PARNI}}_{\theta , K_j}(\gamma (r-1), \gamma (r)) \end{aligned}$$

(30)

where $\gamma (0)$ is current state $\gamma $ and $\gamma (p_k)$ is the final proposal $\gamma ^\prime $. The Metropolis-Hastings acceptance probability of the PARNI proposal is given as

$$\begin{aligned} \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\eta ^{\text {opt}}}(k|\gamma ^\prime )q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\eta ^{\text {opt}}}(k|\gamma )q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )} \right\} . \end{aligned}$$

(31)

In specifying these weighting functions $g_r$ for $r = 1, \ldots , p_{k}$, because each sub-proposal in the PARNI proposal can be treated as addition/deletion move, it is feasible to choose the thresholding function as LIT in Zhou et al. (2021) for different moves. We consider the following thresholded weighting function

$$\begin{aligned} g_r(t) = {\left\{ \begin{array}{ll} \min \{\max \{p^{-1}, t\}, p\}, &{}\text { if }\gamma (r)_{K_r} = 0\\ \min \{\max \{p^{-1}, t\}, 1\}, &{}\text { if }\gamma (r)_{K_r} = 1 \end{array}\right. } \end{aligned}$$

(32)

for $r = 1, \ldots , p_{k}$. The weighting functions in the reverse move are defined similarly. Alternatively, we can also use a balancing function g in PARNI. The choice of balancing function mainly focuses on three particular candidates: square root function $g_{\text {sq}} (t) = \sqrt{t}$, Hastings’ choice $g_{\text {H}}(t)=\min \{1,t\}$ and Barker’s choice $g_{\text {B}} = t/(1+t)$. The comparisons of these balancing functions in Supplement B.1.3 of Zanella (2020) illustrate two major findings. The Hastings’ and Barker’s choices only differ by at most a factor of 2 due to their similar asymptotic behaviors. The square root function mixes the worst outside the burn-in phase. Therefore, we consider the Hastings’ choice throughout (i.e.

$$\begin{aligned} g_r(t) = \min \{1,t\} \end{aligned}$$

(33)

for all $r = 1, \ldots , p_{k}$) in the rest of the paper. Similar results are also expected for the Barker’s choice. Using the balancing function would lead to a simpler form of the Metropolis-Hastings acceptance probability and this is illustrated by the following proposition:

Proposition 3

Suppose $\gamma $, $\gamma ^\prime \in \Gamma $ are fixed. For any $\theta = (\eta , \omega ) \in \Delta _\epsilon ^{2p+1}$ and k such that $\gamma ^\prime \in N(\gamma , k)$, if the weighting function $g_r$ satisfies $g_r(t)=tg_r(1/t)$ for all r, the Metropolis-Hastings acceptance probability in (31) can be written

$$\begin{aligned} \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \min \left\{ 1, \prod _{r=1}^{p_k} \frac{Z(r)}{Z^\prime (r)} \right\} \end{aligned}$$

(34)

where Z(r), $Z^\prime (r)$ for $r = 1, \cdots , p_k$ are the normalising constants given in (29).

The PARNI proposal which uses thresholding function is referred to as PARNIT whereas one uses balancing function is referred to as PARNIB.

4.2.2 Adaptation schemes for algorithmic parameters

The last building block to complete the PARNI sampler is the adaptation mechanism of the tuning parameters. The posterior inclusion probabilities $\pi _j$ are updated as in the ASI scheme in (5). The magnitude of the proposal thinning parameter $\omega $ is crucial in the mixing time and convergence rate of chains. Therefore, we consider two adaptation schemes for updating $\omega $, the Robins-Monro adaptation scheme (RM) and the Kiefer–Wolfowitz adaptation scheme (KW). For the rest of section, we assume L multiple chains are used for the PARNI sampler.

The Robbins-Monro adaptation scheme is widely used in updating tuning parameters of adaptive MCMC rithms. Andrieu and Thoms (2008) review several adaptive MCMC algorithms using variants of the Robbins-Monro process. Given a specified probability of acceptance $\tau $, the Robbins-Monro adaptation scheme automatically adjusts $\omega $ according to the comparison between the current probability of acceptance and $\tau $. It is generally considered to be a robust adaption scheme. Given the acceptance probability of the l-th chain at the i-th iteration $\alpha ^l_i$, the tuning parameter $\omega $ is updated through the law

$$\begin{aligned} \text {logit}_\epsilon \omega ^{(i+1)} = \text {logit}_\epsilon \omega ^{(i)} + \frac{\phi _i}{L}\sum _{l=1}^L(\alpha ^l_i - \tau ). \end{aligned}$$

(35)

for $\phi _i = O(i^{-\lambda })$ for some constant $1/2< \lambda < 1$. The theoretical optimal value of $\tau $ may not exist for every candidate proposal kernel and choice of posterior distribution. We recommend using the diminishing sequence $\phi _i = i^{-0.7}$ and using a target acceptance rate of 0.65 based on a large number of experiments that will be illustrated in Sect. 5.

Apart from the above Robbins-Monro scheme, the Kiefer–Wolfowitz scheme is another possible adaption in tuning $\omega $ for the PARNI sampler. The Kiefer–Wolfowitz scheme is a stochastic approximation algorithm and modification of the Robbins-Monro scheme in which a finite difference approximation to the derivative is used. In this scheme the tuning parameter is updated to target the optimiser of an objective function of interest. According to the work of Pasarica and Gelman (2010), one can use the expected squared jumping distance as the objective function because the expected squared jumping distance is closely linked to the mixing and convergence properties of a Markov chain. The expected squared jumping distance can be estimated by the average squared jumping distance. An alternative objective function would be the generalised speed measure introduced in Titsias and Dellaportas (2019).

To estimate the finite difference approximation to the derivative of the average squared jumping distance, we exploit the multiple chain implementation of PARNI. The multiple independent chains naturally provide independent samples which fits the Kiefer–Wolfowitz approximation. Our implementation of the Kiefer–Wolfowitz adaption scheme proceeds as follows. We first evenly divide L multiple chains into two equally sized batches, $L^+$ and $L^-$. Let $c_i$ be a diminishing sequence, new proposals are generated using $\omega ^+ = \omega ^{(i)} + c_i$ for chains in $L^+$ and $\omega ^- = \omega ^{(i)} - c_i$ for chains in $L^-$. The average squared jumping distances for these batches (i.e. $\text {ASJD}^{+,(i)}$ and $\text {ASJD}^{-,(i)}$) are estimated using the new proposals and their corresponding probabilities of acceptance. The tuning parameter $\omega $ is then updated according to the rule

$$\begin{aligned} \text {logit}_\epsilon \omega ^{(i+1)} = \text {logit}_\epsilon \omega ^{(i)} + a_i \left( \frac{\text {ASJD}^{+,(i)} - \text {ASJD}^{-,(i)}}{2c_i}\right) . \end{aligned}$$

(36)

We suggest using $a_i = i^{-1}$ and $c_i = i^{-0.5}$ in the Kiefer–Wolfowitz scheme. Further details of the Kiefer–Wolfowitz adaption scheme are given in A.1 of the supplementary material and a feasibility analysis of the Kiefer–Wolfowitz adaption scheme is carried out in C.2 of the supplementary material.

Remark 7

Blum (1954) show the Kiefer–Wolfowitz scheme converges if the diminishing sequences $a_i$ and $c_i$ satisfy $\sum _{i=0}^\infty a_i^2 c_i^{-2} = \infty $. According to Remark 6, the sequences $a_i$ and $c_i$ should have diminishing rate between $-0.5$ and $-1$. Therefore, the only possible pair would be $a_i = i^{-1}$ and $c_i = i^{-0.5}$.

Remark 8

Alternative to adapting the thinning parameter $\omega $ through the above adaptive schemes, one can set $\omega $ to a fixed value of 1/2 for simplicity and the base kernel $q^{\text {THIN}}$ becomes uniformly distributed. Note that fixing $\omega $ at 1/2 does not necessarily lead to optimal mixing for the PARNI scheme.

Pseudocode of the PARNI samplers are given in Algorithm 3. The corresponding transition kernel is referred to as $p_{\theta }^{\text {PARNI}(*)-\bullet }$ for $* = \text {T}$ or $\text {B}$ and $\bullet = \text {RM}$ or $\text {KW}$. In the next section we show that the PARNI sampler is $\pi $-ergodic and satisfy a strong law of large numbers.

4.2.3 Ergodicity and strong law of large numbers

The multiple chain acceleration can be thought of the realisation of L runs on a product space $\Gamma ^{\otimes L}$ with joint variable $\gamma ^{\otimes L} = (\gamma ^{1}, \ldots , \gamma ^{L}) \in \Gamma ^{\otimes L}$. Without loss of generality, suppose further that $L \ge 1$ for the Robbins-Monro adaptation scheme and $L \ge 2$ for the Kiefer–Wolfowitz adaptation scheme. We consider a posterior distribution $\pi $ on the space $\Gamma $ which is of the form

$$\begin{aligned} \pi (\gamma ) \propto p(y|\gamma ) p(\gamma ) \end{aligned}$$

(37)

where both $p(y|\gamma )$ and $p(\gamma )$ are analytically available. In addition, the joint posterior distribution $\pi ^{\otimes L}$ on the product set $\Gamma ^{\otimes L}$ is given as

$$\begin{aligned} \pi ^{\otimes L}(\gamma ^{\otimes L}) = \prod _{l=1}^L \pi (\gamma ^{l}). \end{aligned}$$

(38)

In this section, the symbol $*$ denotes either T or B and the symbol $\bullet $ represents either KW or RM. The sub-proposal mass function of the $\text {PARNI}(*)-\bullet $ sampler given neighbourhood indicator variable k and tuning parameter $\theta = (\eta , \omega )$ is defined by

$$\begin{aligned} \psi ^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime )= p^{\text {RN}}_\eta (k|\gamma )q^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime ). \end{aligned}$$

(39)

The full transition kernel of the PARNI sampler is marginalised over all possible k

$$\begin{aligned} P^{\text {PARNI}(*)-\bullet }_{\theta }(\gamma , S) = \sum _{k \in \mathcal {K}} P^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , S) \end{aligned}$$

(40)

where the sub-transition kernels given k are

$$\begin{aligned} P^{\text {PARNI}(*)-\bullet }_{(\theta , k)}(\gamma , S)&= \sum _{\gamma ^\prime \in S} p^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime ) \nonumber \\&= \sum _{\gamma ^\prime \in S} \psi ^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime ) \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) \nonumber \\&\quad + \mathbb {I}\{\gamma \in S\} \sum _{\gamma ^\prime \in \Gamma } \psi ^{\text {PARNI}(*)-\bullet }_{\theta , k}(\gamma , \gamma ^\prime )(1- \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )) \end{aligned}$$

(41)

and $\alpha ^{\text {PARNI}}_{\theta , k}$ are Metropolis-Hastings acceptance rates in (31). The Markov chain transition kernel that works on the product space $\Gamma ^{\otimes L}$ is given as

$$\begin{aligned} P^{\text {PARNI}(*)-\bullet }_{(\theta , k^{\otimes L})}(\gamma ^{\otimes L}, S^{\otimes L}) = \prod _{l=1}^L P^{\text {PARNI}(*)-\bullet }_{(\theta , k^{l})}(\gamma ^{l}, S^{l}). \end{aligned}$$

(42)

To establish the ergodicity and a SLLN of the PARNI sampler and its multiple chain acceleration, we require the following assumptions:

(A.1)
The weighting function $g:\mathbb {R}^+ \rightarrow \mathbb {R}^+$ is $C_g$-Lipschitz. That is to say for any $t_2> t_1 > 0$, there exists a constant $C_g$ such that the weighting function g satisfies
$$\begin{aligned} |g(t_2) - g(t_1)| \le C_g |t_2 - t_1|. \end{aligned}$$
(43)
The thresholded function of LIT clearly satisfies this assumption. This is also a common condition for the proper choice of balancing functions. For example, Hastings’ choice $g_\text {H}(t) = \min \{1,t\}$ follows (45) immediately for $C_g = 1$ and Barker’s choice $g_\text {B} = t/(1+t)$ also follows (45) when $C_g = 1$ (i.e. the maximum derivative).
(A.1.a)
Given a small positive real number c that is the universal minimum value of the ratio below
$$\begin{aligned} \frac{\pi (\gamma ^\prime )p_\eta ^{\text {RN}}(k|\gamma ^\prime )}{\pi (\gamma )p_\eta ^{\text {RN}}(k|\gamma )} \end{aligned}$$
(44)
for all $\gamma $, $\gamma ^\prime \in \Gamma $ and $k \in \mathcal {K}$ and $\eta \in \Delta ^p_\epsilon = (\epsilon ,1-\epsilon )^p$, the weighting function $g:(c,\infty ) \rightarrow (c,\infty )$ is $C_g$-Lipschitz. That is to say for any $t_2> t_1> c > 0$, there exists a constant $C_g$ such that the weighting function g satisfies
$$\begin{aligned} |g(t_2) - g(t_1)| \le C_g |t_2 - t_1|. \end{aligned}$$
(45)
The square root function $g_{\text {sq}}(t) = \sqrt{t}$ satisfy this condition by setting $C_g$ to be $c^{-1/2}/2$.
(A.2)
The posterior distribution $\pi $ is everywhere positive and bounded, that is, there exists a positive $\Pi \in (1, \infty )$ such that
$$\begin{aligned} \frac{1}{\Pi } \le \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} \le \Pi \end{aligned}$$
for all $\gamma $, $\gamma ^\prime \in \Gamma $.
(A.3)
Recall the interval $\Delta _\epsilon ^{2p+1} = (\epsilon , 1-\epsilon )^{2p+1}$, the tuning parameters $\theta ^{(i)} = (\eta ^{(i)}, \omega ^{(i)})$ are bounded away from 0 and 1, and lie in this interval
$$\begin{aligned} \theta ^{(i)} \in \Delta _\epsilon ^{2p+1} \end{aligned}$$
(46)
for some small $\epsilon \in (0, 1/2)$.

The analysis of convergence and ergodicity often relies on the distribution of the Markov chain at time i along with its associated total variation distance $\Vert \cdot \Vert _{TV}$ at an arbitrary starting point. Given $\{\gamma ^{l,(i)}\}_{i=0}^\infty $ these are defined as

$$\begin{aligned}&\mathcal {L}^{l,(i)} [(\gamma ^{l}, \theta ), S] := \Pr \left[ \gamma ^{l, (i)} \in S| \gamma ^{l, (0)} = \gamma ^{l}, \theta ^{0} = \theta \right] , \end{aligned}$$

(47)

$$\begin{aligned}&\lim _{i \rightarrow \infty } T^l(\gamma ^l, \theta , i) := \Vert \mathcal {L}^{l,(i)} [(\gamma ^{l}, \theta ), \cdot ] - \pi (\cdot ) \Vert _{TV}. \end{aligned}$$

(48)

We show here that the PARNI sampler is ergodic and satisfies a strong law of large numbers (SLLN). In mathematical terms for any starting point $\gamma ^{\otimes L} \in \Gamma ^{\otimes L}$ and $\theta \in \Delta _\epsilon ^{2p+1}$ ergodicity means that

$$\begin{aligned} \lim _{i \rightarrow \infty } T^l(\gamma ^l, \theta , i) \rightarrow 0, \quad \text {} \end{aligned}$$

(49)

for any $l = 1, \ldots , L$, while a strong law of large numbers (SLLN) implies that

$$\begin{aligned} \frac{1}{NL} \sum _{i=0}^{N-1} \sum _{l=1}^L f(\gamma ^{l,(i)}) \rightarrow \pi (f) \end{aligned}$$

(50)

almost surely, for any $f:\Gamma \rightarrow \mathbb {R}$. We first establish two technical results before presenting the main theorem of this section.

Lemma 1

(Simultaneous Uniform Ergodicity) The MCMC transition kernel $P_\theta ^{\text {PARNI}(*)-\bullet }$ in (40) with target distribution $\pi $ in (37) is simultaneously uniformly ergodic for any choice of $\epsilon \in (0,1/2)$ in (46). i.e. for any $\delta >0$, there exists $N = N(\delta , \epsilon )$ such that

$$\begin{aligned} \left\| \left( P^{\text {PARNI}(*)-\bullet }_\theta (\gamma ^{\otimes L},\cdot )\right) ^N - \pi ^{\otimes L}(\cdot )\right\| _{TV} \le \delta \end{aligned}$$

holds for any any starting point $\gamma ^{\otimes L} \in \Gamma ^{\otimes L}$ and any value $\theta \in \Delta _\epsilon ^{2p+1}$.

Lemma 2

(Diminishing adaptation) Let the constant of adaptation rate $\lambda $ be in (1/2, 1) for $\bullet = \text {RM}$ and be 1/2 for $\bullet = \text {KW}$, for any $\epsilon \in (0, 1/2)$ and $\pi _0 \in (0, 1)$, the PARNI sampler satisfies diminishing adaptation, that is, its transition kernel satisfies

$$\begin{aligned} \sup _{\gamma \in \Gamma } \left\| P^{\text {PARNI}(*)-\bullet }_{\theta ^{(i+1)}}(\gamma , \cdot ) - P^{\text {PARNI}(*)-\bullet }_{\theta ^{(i)}}(\gamma , \cdot ) \right\| _{TV} \le C i^{-\lambda } \end{aligned}$$

(51)

for some constant $C < \infty $.

Theorem 2

(Ergodicity and SLLN) Consider a target distribution $\pi (\gamma )$ in (37), constant of adaptation rate $\lambda \in (1/2,1)$ for $\bullet = \text {RM}$ and $\lambda = 1/2$ for $\bullet = \text {KW}$ and $\epsilon \in (0,1/2)$ that lead to a adaptation rate $\mathcal {O}(i^{-\lambda })$, and the parameter $\pi _0 > 0$ in Algorithm 3. Then ergodicity (49) and a strong law of large numbers (50) hold for all the $\text {PARNI(T)-KW}$, $\text {PARNI(T)-RM}$, $\text {PARNI(B)-KW}$ and $\text {PARNI(B)-RM}$ samplers as described in Algorithm 3 and its corresponding multiple chain acceleration versions.

Table 1 Simulated data: relative average mean squared errors for the Adaptively scaled individual (ASI), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) schemes on estimating posterior inclusion probabilities over important and unimportant variables respectively against a standard Add-Delete-Swap algorithm

Full size table

5 Numerical studies

5.1 Simulated data

We consider the data generation model introduced by Yang et al. (2016), and replicated in simulation studies conducted by Griffin et al. (2021) and Zanella and Roberts (2019). Suppose a linear model with n observations and p covariates is needed, data are generated from the model specification

$$\begin{aligned} y = X^* \beta ^* + \epsilon \end{aligned}$$

where $\epsilon \sim N_n(0, \sigma ^2 I_n)$ for pre-specified residual variance $\sigma ^2$ and $\beta ^* = \text {SNR}\times \tilde{\beta }\sqrt{(\sigma ^2\log p)/n}$ in which $\text {SNR}$ represents the signal-to-noise ratios. Let $\tilde{\beta } = (2, -3, 2,2,-3,3,-2,3,-2,3,0,\cdots ,0)$ and each row of the design matrix $X^*_i$ follow a multivariate normal distribution with mean zero and covariance $\Sigma $ with entries $\Sigma _{j j} = 1$ for all j and $\Sigma _{ij} = 0.6^{|i-j|}$ for $i \ne j$. We consider four choices of SNR, namely 0.5, 1, 2 and 3, two choices of n, namely 500 and 1,000 and three choices of p, namely 500, 5000 and 50,000.

We use the same prior parameter values $V_\gamma = I_{p_\gamma }$, $g = 9$ and $h = 10/p$ as specified in Griffin et al. (2021). In the same work, a detailed description of the resulting posterior distributions is given. In the presence of a low SNR ($\text {SNR} = 0.5$), there is too much noise to detect the true non-zero variables and the resulting posterior is rather flat, with no variables having posterior inclusion probabilities larger than 0.1. The posterior distributions are completely different when the SNR is large ($\text {SNR} = 2$ and $\text {SNR} = 3$). In these cases all of the true non-zero variables have inclusion probabilities close to 1 as the posterior distributions are more concentrated. In the intermediate case $\text {SNR} = 1$ slightly less than half of the true non-zero variables have inclusion probabilities above 0.8. In general the problem of finding the true non-zero variables becomes more difficult in the cases with lower SNR, smaller n and larger p.

We are interested in comparing the performance of the ASI and PARNI schemes relative to an ADS sampler because the ASI scheme has been compared with several other state-of-the-art MCMC algorithms in Griffin et al. (2021). The adaptive algorithms are run with 25 multiple chains. The first third of the chain are identified as the period of burn-in. In addition, to reduce the computational budget, all the adaptations terminate after the period of burn-in.

Trace plots of chains are a straightforward way to visualise convergence. Figure 2 are the trace plots of posterior model probabilities from the ADS, ASI, PARNIT-KW, PARNIT-RM, PARNIB-KW and PARNIB-RM algorithms for the first 1500 iterations when the SNR = 2. The ADS scheme fails to converge for all choices of n and p and in particular becomes trapped at areas around the null model (i.e. the empty model) for a long period of time when $p=50,000$. The ASI scheme converges reasonably quickly when p is 500 or 5000, but takes longer to reach high probability regions when $p = 50,000$. This suggests that ASI mixes worse and converges slower in high-dimensional data-sets. On the other hand, all the PARNI samplers mix rapidly in this setting for which they only take several moves to converge properly.

The trace plots are not truly a fair comparison as they do not take into account running time. To better address the issue of computational efficiency we ran all of the algorithms for 3 repetitions. Each individual chain was run for 15 min and we stored the estimates of posterior inclusion probabilities. We calculated mean squared errors of these estimates compared to “gold standard” estimates taken from a weighted tempered Gibbs sampler that was run for roughly 12 h. We show results in the form of performance relative to the ADS scheme in Table 1. Smaller values always indicate better performance of the scheme. The value of $-1$ indicates the scheme yields 10 times smaller mean squared errors compared to those from the ADS scheme in this specific data-set. Generally speaking, the mean squared errors for important variables are greater than those for important variables for almost every data-set and scheme. The choice of n does not significantly affect the performance of the samplers. Concentrating on the results for important variables, the ASI scheme leads to an order of magnitude improvement in efficiency over the ADS sampler, which match the results in Griffin et al. (2021). The four PARNI algorithms with different weighting functions and adaptations lead to similar levels of accuracy and dominate both the ASI and ADS schemes in every case except $p=500$. In particular, the PARNI schemes result in roughly $10^5$ times improvements over ADS and more than 10 times improvements over ASI when $p=50,000$ and SNR$ = 2$. On the other hand, the ADS scheme is quite adept at removing the unimportant variables when the true model size is small compared to the number of covariates. When $p = 50,000$ and SNR$>1$ the ASI scheme struggles with unimportant variables and leads to worse estimates than ADS, but the PARNI algorithms produce better estimates even for these unimportant variables. Overall, the results suggest PARNI samplers are more computationally efficient than alternatives when p is large. More results from simulated data are provided in Section C.3 of the supplementary material.

Table 2 Table of prior specifications of 8 real dataset

Full size table

5.2 Real data

We consider eight real data-sets implemented in Griffin et al. (2021), four of them with moderate p and four with larger p.

The first data-set is the Tecator data-set, which is previously analysed by Brown and Griffin (2010) in Bayesian linear regression and implemented by Lamnisos et al. (2013) and Griffin et al. (2021) in the context of Bayesian variable selection. It contains 172 observations and 100 explanatory variables. We also consider three small p data sets constructed by Schäfer and Chopin (2013) to illustrate the performance of sequential Monte Carlo algorithms on Bayesian variable selection problems, the Boston Housing data $(n = 506, p=104)$, the Concrete data $(n = 1030, p = 79)$ and the Protein data $(n = 96, p = 88)$. These data sets are extended by squared and interaction terms which lead to high dependencies and multicollinearity.

The last four data sets are high-dimensional problems with very large p. Three of them come from an experiment conducted by Lan et al. (2006) to examine the genetics of two inbred mouse populations. The experiment resulted in a set of data with 60 observations in total that were used to monitor the expression levels of 22, 575 genes of 31 female and 29 male mice. Bondell and Reich (2012) first considered this data-set in the context of variable selection. Three physiological phenotypes are also measured by quantitative real-time polymerase chain reaction (PCR), they are used as possible responses and are named $\text {PCR}i$ for $i=1,2,3$ respectively. For more details, see Lan et al. (2006); Bondell and Reich (2012). The last data-set concerns genome-wide mapping of a complex trait. The data are illustrated in Carbonetto et al. (2017). They are body and testis weight measurements recorded for 993 outbred mice, and genotypes at 79,748 single nucleotide polymorphisms (SNPs) for the same mice. The main purpose of the study is to identify genetic variants contributing to variation in testis weight. Thus, we consider the testis weight as response, the body weight as a regressor that is always included in the model and variable selection is performed on the 79,748 SNPs.

Before analysing the performance of MCMC algorithms on the above data-sets, it is worth discussing the selection of an optimal acceptance rate for the PARNI-RM sampler. The optimal scaling property of a Gaussian random walk proposal on some specific forms of target distribution is a well-studied problem. The most commonly used guideline is to seek an average acceptance rate of 0.234 (Gelman et al. 1997). The optimal acceptance rates for sophisticated informed proposals involving gradient information are typically larger, e.g. 0.57 for the Metropolis-adjusted Langevin algorithm (Grenander and Miller 1994; Roberts and Rosenthal 1998) and 0.65 for Hamiltonian Monte Carlo (Duane et al. 1987; Beskos et al. 2013). As our balanced random neighbourhood proposals can be viewed as a discrete analog to these gradient-based algorithms, it is natural to think that the PARNI samplers will have a larger optimal acceptance rate than a random walk Metropolis. To test this, we ran the PARNIT-RM and PARNIB-RM schemes targeting different rates of acceptance on the above data-sets. Figures 3 and 4 show the effect of the average acceptance rate on the expected squared jumping distance and average mean squared errors of these two schemes respectively. Both of the figures imply the same conclusions. Parts (a) and (b) of the figure illustrate the relation between the thinning parameter $\omega $ and the average acceptance rate. Bigger values of $\omega $ are synonymous with larger jumps and therefore can lead to a smaller average acceptance rate. Parts (c) and (d) of the figure suggest that the maximum average squared jumping distance occurs when the acceptance rate is around 0.65 for all data-sets. Parts (e) and (f) show that the average mean squared error is minimised when the average acceptance rate is around a similar region. Therefore, for problems we have looked at, targeting an average acceptance rate of 0.65 does not perform badly. Similar results for the simulated data-sets of 5.1 are presented in C.1 of the supplementary material. We stress that the PARNIT-KW and PARNIB-KW schemes does not require a target acceptance rate to be chosen, so users who are uncomfortable with having to choose this quantity for a particular data-set are recommended to use this version of the sampler.

We consider a total of ten different MCMC schemes for these sets of data. In addition to the six schemes used in the simulation study (ADS, ASI, PARNIT-KW, PARNIT-RM, PARNIB-KW and PARNIB-RM), we also implement four state-of-the-art algorithms, the Hamming ball sampler (HBS) with radius of 1 of Titsias and Yau (2017), both the tempered Gibbs sampler (TGS) and weighted tempered Gibbs sampler (WTGS) of Zanella and Roberts (2019), and also the Locally Informed and Thresholded (LIT) scheme by Zhou et al. (2021) (which uses same weighting function as the LIT-MH-1 scheme in their paper). All algorithms are run for the same amount of time and compared using average mean squared errors. Only the adaptive schemes are run with 25 multiple shorter chains while other schemes use a single longer chain. The prior specification for each data-set is given in Table 2 (Table table:realspsmse).

Figures 5 and 6 show trace plots of posterior model probabilities from the ADS, ASI, PARNIT-KW, PARNIT-RM, PARNIB-KW and PARNIB-RM algorithms for the first 1500 iterations in all eight real data-sets. Overall, the PARNI algorithms perform better than the ADS and ASI schemes in both convergence and mixing. It is clear that the ADS scheme does not mix well since it struggles to explore model space. All algorithms do reach high-probability regions for data-sets with moderate p in roughly the same number of iterations, however the PARNI schemes can reach these high-probability regions faster and accept more jumps inside the model space. In the large p data-sets, these algorithms lead to different behaviour. The ADS scheme gets trapped in the null model and only proposes models around it and the ASI algorithm does not converge properly for the first 1500 iterations either. The PARNI schemes, by contrast, accept almost every proposed states and mix very quickly. They are able to propose and accept models with relatively low posterior probabilities and explore the sample space efficiently.

We next turn attention to the average mean squared errors on these eight real data-sets. These results are shown in Table 3. In moderate p data-sets, the PARNI samplers do not dominate other schemes, but they still lead to good results. However, PARNI performs worst for the Boston Housing and Concrete data-sets, which are multi-modal and contain intricately correlated covariates. This implies that the point-wise sub-proposals of PARNI can become trapped at isolated local modes. The ADS scheme performs well in terms of computational efficiency for the Tecator and Concrete data-sets due to a convenient computational implementation which has the cheapest computational costs among competing schemes. Due to the dimension-free mixing property, the LIT scheme outperforms ADS except on the Tecator data-set where all covariates carry non-negligible weights and all covariates are therefore in the potentially influential subset S of $\Gamma $ (see Sect. 2.3 of Zhou et al. (2021) for more detail). For large p problems all the PARNI schemes significantly outperform other samplers. Surprisingly, the HBS and TGS schemes lead to worse estimates than ADS. This can be explained by the computational cost per iteration of the HBS, TGS and WTGS algorithms, which is linear in p. The combination of these large computational costs and the issue of rarely exploring important variables lead to low efficiencies for HBS and TGS. The WTGS algorithm still outperforms TGS, which coincides with the conclusions gathered in Zanella and Roberts (2019) where the WTGS algorithm is shown to have smaller relaxation time than TGS. The ASI algorithm gives competitive estimates to WTGS in high-dimension but is eventually dominated by the PARNI schemes. The LIT scheme leads to better results in the SNP data-set ($n = 993$) but not in PCR data-sets ($n=60$) since the dimension-free mixing of LIT only holds when n is comparatively large. And it yields larger average mean squared errors than the PARNI samplers in all large p data-sets because the ADS type neighbourhoods of the LIT scheme only contains models with at most 2 changes in Hamming distance and the jumping distance of LIT therefore is bounded by 2 whereas PARNI can potentially propose larger jumps. Among the PARNI schemes, the two weighting schemes (thresholding and balancing function) have a similar level of efficiency. Specifically, using the thresholding function estimates the posterior inclusion probabilities with lower relative average mean square errors than the balancing function in all four moderate p data-sets and the PCR1 data-set. The performance of all PARNI schemes is similar for the SNP data-set and outperforms that of competitors. In terms of the adaptation schemes, the PARNI sampler with Kiefer–Wolfowitz adaption generally performs better than the Robbins-Monro version, but only by a small margin. This is due to the fact that the optimal acceptance rates are problem-specific and not exactly 0.65 for every data-set.

Table 3 Real data: relative average mean squared errors for the Adaptively Scaled Individual (ASI), Hamming Ball Sampler (HBS), Tempered Gibbs Sampler (TGS), Weighted Tempered Gibbs Sampler (WTGS), Locally Informed and Thresholded (LIT), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Kiefer–Wolfowitz update (PARNIT-KW), Pointwise implementation of Adaptive Random Neighbourhood Informed and Thresholded proposal with Robbins-Monro update (PARNIT-RM), Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Kiefer–Wolfowitz update (PARNIB-KW) and Pointwise implementation of Adaptive Random Neighbourhood Informed and balanced proposal with Robbins-Monro update (PARNIB-RM) schemes on estimating posterior inclusion probabilities over important variables against a standard Add-Delete-Swap algorithm

Full size table

6 Discussion and future work

In this paper we present a framework for neighbourhood based MCMC algorithms, and propose a new scheme as an informed counterpart to the ASI algorithm in Griffin et al. (2021), using elements from locally informed Metropolis-Hastings introduced in Zanella (2020) and Zhou et al. (2021). To address the expensive computational costs introduced by the informed proposal, we introduce two less computationally costly algorithms, the PARNI schemes, which can lead to a dramatic improvement in computational efficiency. In addition, we offer two options of informed weighting functions, the thresholding function and balancing function. The PARNI schemes also allow two different adaptation schemes, the Kiefer–Wolfowitz and Robbins-Monro schemes. The numerical results from Sect. 5 support the power of the algorithmic structure of PARNI. The success of these new schemes is attributed to two aspects. Firstly the adaptation helps to explore the areas of interest (mainly with high posterior probabilities), and secondly the locally informed proposals are able to stabilise random walk behaviour in high-dimensions and lead to rapidly mixing samplers in practice. From the numerical studies on both simulated and real data-sets, we recommend using a PARNI sampler with the Kiefer–Wolfowitz scheme for tackling high-dimensional (or large p) Bayesian variable selection problems. We note that it can still be challenging for the PARNI samplers to move across low probabilistic regions, which could affect performance when the posterior has very isolated modes. This phenomenon is due to the fact that the PARNI samplers propose models sequentially where each sub-proposal can alter only 1 position at most. On the other hand, the original ARNI scheme can take larger jumps and is more able to explore well-separated modes, albeit with a substantial increase in computational costs. In summary, new schemes like PARNI show the potential of combining adaptive, random neighbourhood and informed proposals. We look forward to adding more theoretical support to the numerical evidence shown here in future work. In addition, the code to run the PARNI samplers and aforementioned numerical studies can be downloaded from https://github.com/XitongLiang/The-PARNI-scheme.git.

There are many directions for extensions and future work. Some recent work has shed light on the issues of extra computational costs that come with informed proposals. Grathwohl et al. (2021) develop an accelerated locally informed proposal that uses derivatives with respect to the log mass functions. It is possible to derive the gradient of the posterior mass function with respect to $\gamma $ with minor modifications to representations of the posterior distribution $\pi (\gamma )$. To address the lack of mode jumping in the PARNI schemes, we can first try to construct larger blocks intelligently so that separated models are covered in one single block. This solution can be achieved by introducing basis vectors beyond the Cartesian case in the block construction. One can also use the sequential Monte Carlo methods of Schäfer and Chopin (2013) and Ma (2015), which are more able to handle multimodality. Combining them with PARNI yields the chance of producing efficient methods on highly multimodal posterior distributions with well-separated modes. Another option in this direction is the JAMS algorithm of Pompe et al. (2020) that first locates each individual mode and then produces a mixture proposal that involves jumps within and between modes.

We also intend to study the performance of the PARNI schemes in generalised linear models as in Wan and Griffin (2021) or a more flexible Bayesian variable selection model such as that suggested by Rossell and Rubio (2018). In these cases, regression coefficients and residual variance are no longer integrated out analytically and the likelihood of $\gamma $ is not available in closed form. Informed proposals for such models are computationally challenging because the proposals involve the evaluations of these likelihood but the required approximations and estimates of the marginal likelihood are computationally intensive. One possible approach is the data-augmentation method using the Pólya-gamma distribution as described in Polson et al. (2013). The design does however require some care to avoid inefficiency causing by introducing a large number of auxiliary variables in large n problems. We also believe that random neighbourhood samplers can be used beyond variable selection, and aim to consider applications to other discrete-valued sampling problems in future work.

References

Andrieu, C., Lee, A., Livingstone, S.: A general perspective on the Metropolis–Hastings kernel. arXiv:2012.14881 (2020)
Andrieu, C., Thoms, J.: A tutorial on adaptive MCMC. Stat. Comput. 18(4), 343–373 (2008)
Article MathSciNet Google Scholar
Beskos, A., Pillai, N., Roberts, G., Sanz-Serna, J.-M., Stuart, A.: Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli 19(5A), 1501–1534 (2013)
Article MathSciNet MATH Google Scholar
Blum, J.R., et al.: Approximation methods which converge with probability one. Ann. Math. Stat. 25(2), 382–386 (1954)
Article MathSciNet MATH Google Scholar
Bondell, H.D., Reich, B.J.: Consistent high-dimensional Bayesian variable selection via penalized credible regions. J. Am. Stat. Assoc. 107(500), 1610–1624 (2012)
Article MathSciNet MATH Google Scholar
Brown, P.J., Griffin, J.E.: Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1), 171–188 (2010)
Article MathSciNet MATH Google Scholar
Brown, P.J., Vannucci, M., Fearn, T.: Bayesian wavelength selection in multicomponent analysis. J. Chemom. J. Chemom. Soc. 12(3), 173–182 (1998)
Google Scholar
Carbonetto, P., Zhou, X., Stephens, M.: varbvs: fast variable selection for large-scale regression (2017). arXiv:1709.06597
Chen, X., Qamar, S., Tokdar, S. T.: Paired-move multiple-try stochastic search for Bayesian variable selection (2016). arXiv:1611.09790
Chipman, H., George, E.I., McCulloch, R.E., Clyde, M., Foster, D.P., Stine, R.A.: The practical implementation of Bayesian model selection. Lecture Notes-Monograph Series pp. 65–134 (2001)
Craiu, R.V., Rosenthal, J., Yang, C.: Learn from thy neighbor: parallel-chain and regional adaptive MCMC. J. Am. Stat. Assoc. 104(488), 1454–1466 (2009)
Article MathSciNet MATH Google Scholar
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Phys. Lett. B 195(2), 216–222 (1987)
Article MathSciNet Google Scholar
Fernandez, C., Ley, E., Steel, M.F.J.: Benchmark priors for Bayesian model averaging. J. Econom. 100(2), 381–427 (2001)
Article MathSciNet MATH Google Scholar
Fort, G., Moulines, E., Priouret, P., et al.: Convergence of adaptive and interacting Markov chain Monte Carlo algorithms. Ann. Stat. 39(6), 3262–3289 (2011)
Article MathSciNet MATH Google Scholar
Gagnon, P.: Informed reversible jump algorithms. Electron. J. Stat. 15(2), 3951–3995 (2021)
Article MathSciNet MATH Google Scholar
Garcia-Donato, G., Martinez-Beneito, M.A.: On sampling strategies in Bayesian variable selection problems with large model spaces. J. Am. Stat. Assoc. 108(501), 340–352 (2013)
Article MathSciNet MATH Google Scholar
Gelman, A., Gilks, W.R., Roberts, G.O.: Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab. 7(1), 110–120 (1997)
Article MathSciNet MATH Google Scholar
George, E.I., McCulloch, R.: Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88(423), 881–889 (1993)
Article Google Scholar
George, E.I., McCulloch, R.E.: Approaches for Bayesian variable selection. Stat. Sin. 7, 339–373 (1997)
MATH Google Scholar
Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D., Maddison, C.J.: Oops I took a gradient: scalable sampling for discrete distributions (2021). arXiv:2102.04509
Grenander, U., Miller, M.I.: Representations of knowledge in complex systems. J. R. Stat. Soc. Ser. B (Methodol.) 56(4), 549–581 (1994)
MathSciNet MATH Google Scholar
Griffin, J.E., Brown, P.J.: Bayesian global-local shrinkage methods for regularisation in the high dimension linear model. Chemom. Intell. Lab. Syst. 210, 104255 (2021)
Article Google Scholar
Griffin, J., Łatuszyński, K., Steel, M.: In search of lost mixing time: adaptive Markov chain Monte Carlo schemes for Bayesian variable selection with very large p. Biometrika 108(1), 53–69 (2021)
Article MathSciNet MATH Google Scholar
Hans, C., Dobra, A., West, M.: Shotgun stochastic search for large-$p$ regression. J. Am. Stat. Assoc. 102(478), 507–516 (2007)
Article MathSciNet MATH Google Scholar
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton (2015)
Book MATH Google Scholar
Ji, C., Schmidler, S.C.: Adaptive Markov chain Monte Carlo for Bayesian variable selection. J. Comput. Graph. Stat. 22(3), 708–728 (2013)
Article MathSciNet Google Scholar
Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462–466 (1952)
Article MathSciNet MATH Google Scholar
Lamnisos, D., Griffin, J.E., Steel, M.F.J.: Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J. Comput. Graph. Stat. 18(3), 592–612 (2009)
Article MathSciNet Google Scholar
Lamnisos, D., Griffin, J.E., Steel, M.F.J.: Adaptive $\text{MC}^{3}$ and Gibbs algorithms for Bayesian model averaging in linear regression models (2013). arXiv:1306.6028
Lan, H., Chen, M., Flowers, J.B., Yandell, B.S., Stapleton, D.S., Mata, C.M., Mui, E.T.-K., Flowers, M.T., Schueler, K.L., Manly, K.F., et al.: Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genet. 2(1), e6 (2006)
Article Google Scholar
Łatuszyński, K., Roberts, G.O., Rosenthal, J.S.: Adaptive Gibbs samplers and related MCMC methods. Ann. Appl. Probab. 23(1), 66–98 (2013)
Article MathSciNet MATH Google Scholar
Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O.: Mixtures of $g$ priors for Bayesian variable selection. J. Am. Stat. Assoc. 103(481), 410–423 (2008)
Article MathSciNet MATH Google Scholar
Livingstone, S., Zanella, G.: The Barker proposal: combining robustness and efficiency in gradient-based MCMC (2019). arXiv:1908.11812
Ma, L.: Scalable Bayesian model averaging through local information propagation. J. Am. Stat. Assoc. 110(510), 795–809 (2015)
Article MathSciNet MATH Google Scholar
Madigan, D., York, J., Allard, D.: Bayesian graphical models for discrete data. Int. Stat. Rev. 63, 215–232 (1995)
Article MATH Google Scholar
Mitchell, T.J., Beauchamp, J.J.: Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 83(404), 1023–1032 (1988)
Article MathSciNet MATH Google Scholar
Narisetty, N.N., He, X.: Bayesian variable selection with shrinking and diffusing priors. Ann. Stat. 42(2), 789–817 (2014)
Article MathSciNet MATH Google Scholar
Pasarica, C., Gelman, A.: Adaptively scaling the Metropolis algorithm using expected squared jumped distance. Stat. Sin. 20, 343–364 (2010)
MathSciNet MATH Google Scholar
Peskun, P.H.: Optimum Monte-Carlo sampling using Markov chains. Biometrika 60(3), 607–612 (1973)
Article MathSciNet MATH Google Scholar
Polson, N.G., Scott, J.G.: Shrink globally, act locally: sparse Bayesian regularization and prediction. Bayesian Stat. 9(501–538), 105 (2010)
Google Scholar
Polson, N.G., Scott, J.G., Windle, J.: Bayesian inference for logistic models using Pólya-Gamma latent variables. J. Am. Stat. Assoc. 108(504), 1339–1349 (2013)
Article MATH Google Scholar
Pompe, E., Holmes, C., Łatuszyński, K., et al.: A framework for adaptive MCMC targeting multimodal distributions. Ann. Stat. 48(5), 2930–2952 (2020)
Article MathSciNet MATH Google Scholar
Power, S., Goldman, J. V.: Accelerated sampling on discrete spaces with non-reversible Markov processes (2019). arXiv:1912.04681
Roberts, G.O., Rosenthal, J.S.: Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 60(1), 255–268 (1998)
Article MathSciNet MATH Google Scholar
Roberts, G.O., Rosenthal, J.S.: Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J. Appl. Probab. 44(2), 458–475 (2007)
Article MathSciNet MATH Google Scholar
Roberts, G.O., Rosenthal, J.S., et al.: General state space Markov chains and MCMC algorithms. Probab. Surv. 1, 20–71 (2004)
Article MathSciNet MATH Google Scholar
Rossell, D., Rubio, F.J.: Tractable Bayesian variable selection: beyond normality. J. Am. Stat. Assoc. 113(524), 1742–1758 (2018)
Article MathSciNet MATH Google Scholar
Schäfer, C., Chopin, N.: Sequential Monte Carlo on large binary sampling spaces. Stat. Comput. 23(2), 163–184 (2013)
Article MathSciNet MATH Google Scholar
Shang, Z., Clayton, M.K.: Consistency of Bayesian linear model selection with a growing number of parameters. J. Stat. Plann. Inference 141(11), 3463–3474 (2011)
Article MathSciNet MATH Google Scholar
Steel, M.F.J., Ley, E.: On the Effect of Prior Assumptions in Bayesian Model Averaging with Applications to Growth Regression. The World Bank, Washington (2007)
Book Google Scholar
Tierney, L.: A note on Metropolis-Hastings kernels for general state spaces. Anna. Appl. Probab. 8, 1–9 (1998)
MathSciNet MATH Google Scholar
Titsias, M., Dellaportas, P.: Gradient-based adaptive Markov chain Monte Carlo. Adv. Neural. Inf. Process. Syst. 32, 15730–15739 (2019)
Google Scholar
Titsias, M.K., Yau, C.: The Hamming ball sampler. J. Am. Stat. Assoc. 112(520), 1598–1611 (2017)
Article MathSciNet Google Scholar
Wan, K.Y.Y., Griffin, J.E.: An adaptive MCMC method for Bayesian variable selection in logistic and accelerated failure time regression models. Stat. Comput. 31(1), 1–11 (2021)
Article MathSciNet MATH Google Scholar
Yang, Y., Wainwright, M.J., Jordan, M.I., et al.: On the computational complexity of high-dimensional Bayesian variable selection. Ann. Stat. 44(6), 2497–2532 (2016)
Article MathSciNet MATH Google Scholar
Zanella, G.: Informed proposals for local MCMC in discrete spaces. J. Am. Stat. Assoc. 115(530), 852–865 (2020)
Zanella, G., Roberts, G.: Scalable importance tempering and Bayesian variable selection. J. R. Stat. Soc. B 81(3), 489–517 (2019)
Article MathSciNet MATH Google Scholar
Zhou, Q., Yang, J., Vats, D., Roberts, G.O., Rosenthal, J.S.: Dimension-free mixing for high-dimensional Bayesian variable selection (2021). arXiv:2105.05719

Download references

Acknowledgements

We would like to thank an Associated Editor and two referees for their helpful comments. XL thanks Dr. Krzysztof Łatuszyński for the discussion and help for the proof of Lemma 2 and related results. SL is supported by an Engineering and Physical Sciences Research Council New Investigator Award EP/V055380/1.

Author information

Authors and Affiliations

Department of Statistical Science, University College London, London, UK
Xitong Liang, Samuel Livingstone & Jim Griffin

Authors

Xitong Liang
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Livingstone
View author publications
You can also search for this author in PubMed Google Scholar
Jim Griffin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xitong Liang.

Ethics declarations

Conflict of interest

The authors have declared that no conflicts of interest exist.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Additional materials

1.1 A.1 The Kiefer–Wolfowitz adaption scheme

The optimal scaling property of a Gaussian random walk proposal on some specific forms of target distribution is well-studied. The most commonly used way to achieve optimal mixing time is to tune scaling parameters, which leads to an average acceptance rate of 0.234. In practice, even in those cases where the posterior distribution does not strictly obey the assumptions, the average acceptance rate of 0.234 is often a suitable guide and results in good practical performance. For those proposals beyond random walks, the guidelines for optimal tuning are often unknown. Due to this fact, we develop an adaptation scheme which is able to adapt the tuning parameters in which the mixing time and convergence rate are optimised without knowing any theoretical results in advance.

We design a scheme that maximises the Expected Squared Jumping Distance (ESJD). The ESJD is an efficiency measure which accounts for the jumping distances between two consecutive states from a Markov chain which is highly related to first order autocorrelation (Pasarica and Gelman 2010). Suppose that $\xi \in \mathbb {R}$ is a continuous tuning parameter of a $\pi $-reversible transition kernel $p_\xi $. In the PARNI scheme, the scaling parameter $\omega $ lies in the interval (0, 1), so we consider a transformation $\xi = \text {logit}_\epsilon \omega $ such that $\xi \in \mathbb {R}$. The definition of ESJD given parameter $\xi $ is given as follows

$$\begin{aligned} \text {ESJD}(\xi )&= \sum _{\gamma \in \Gamma } \sum _{\gamma ^\prime \in \Gamma } \left( \sum _{j = 1}^p (\gamma _j - \gamma _j^\prime )^2 \right) \pi (\gamma ) p_\xi (\gamma , \gamma ^\prime ). \end{aligned}$$

(A.1)

If $p_\xi $ is a Metropolis-Hastings transition kernel and can be decomposed into a product of a proposal kernel $Q_\xi $ with mass function $q_\xi $ and a term of the Metropolis-Hasting acceptance probability $\alpha _\xi (\gamma , \gamma ^\prime )$, the definition of the ESJD above is equivalent to

$$\begin{aligned}&\text {ESJD}(\xi ) \nonumber \\&\quad = \sum _{\gamma \in \Gamma } \sum _{\gamma ^\prime \in \Gamma } \left( \sum _{j = 1}^p (\gamma _j - \gamma _j^\prime )^2 \right) \pi (\gamma ) q_\xi (\gamma , \gamma ^\prime )\alpha _\xi (\gamma , \gamma ^\prime ). \end{aligned}$$

(A.2)

The ESJD is often infeasible to compute since it involves double sum over the sample space. To access the value of ESJD, we consider an estimator Average Squared Jumping Distance (ASJD) which depends on the past chain and ASJD is defined as follows:

$$\begin{aligned} \text {ASJD}(\xi )&= \frac{1}{N} \sum _{i=0}^{N-1} \left( \sum _{j = 1}^p (\gamma ^{(i)}_j - \gamma ^{(i+1)}_j)^2 \right) \end{aligned}$$

(A.3)

or alternatively

$$\begin{aligned} \text {ASJD}(\xi )&= \frac{1}{N} \sum _{i=0}^{N-1} \left( \sum _{j = 1}^p (\gamma ^{(i)}_j - (\gamma ^{(i)})^\prime _j)^2 \right) \alpha _\xi (\gamma ^{(i)}, (\gamma ^{(i)})^\prime ) \end{aligned}$$

(A.4)

where $(\gamma ^{(i)})^\prime $ is the proposal of $\gamma ^{(i)}$ through $Q_\xi $. From above, the main advance of using ASJD is that ASJD can be easily estimated in each individual iteration.

The objective is to locate the value of $\xi $ that leads to the largest ASJD. This is equivalent to solving the following optimisation problem of the tuning parameter

$$\begin{aligned} \xi ^* := \arg \max _{\xi } ~ \text {ASJD}(\xi ). \end{aligned}$$

(A.5)

If objective function $\text {ESJD}(\xi )$ is unimodal and smooth, $\xi ^*$ can be found by solving the first order ordinary differential equation

$$\begin{aligned} \frac{d}{\mathrm{d}\xi } \text {ASJD}(\xi ) = 0. \end{aligned}$$

(A.6)

The Robbins-Monro scheme can be applied here to adaptively update the optimal $\theta $ when those derivatives exist analytically. In most cases, however, the derivatives are not available analytically, which makes the Robbins-Monro scheme impossible to use. The Kiefer–Wolfowitz scheme (Kiefer and Wolfowitz 1952), on the other hand, is an alternative to the Robbins-Monro algorithm where the derivatives are estimated using a finite difference method.

The following is how a Kiefer–Wolfowitz scheme proceeds. Let $M(\xi )$ be an objective function with a maximum $\theta ^*$. If $M(\xi )$ is assumed to be unknown but some random observations $\mathcal {M}(\xi )$ are given such that $M(\xi ) = \mathbb {E}[\mathcal {M}(\xi )]$, $\xi $ is updated following an iterative algorithm as follows

$$\begin{aligned} \xi _{i+1} = \xi _i + a_i \left( \frac{\mathcal {M}(\xi _i + c_i) - \mathcal {M}(\xi _i - c_i)}{2c_i} \right) \end{aligned}$$

(A.7)

where $a_i$ and $c_i$ are two diminishing sequences of forward positive step sizes and finite difference widths used respectively. In each iteration, we need two independent observations, $\mathcal {M}(\xi _i + c_i)$ and $\mathcal {M}(\xi _i - c_i)$, with tuning parameters $\xi _i + c_i$ and $\xi _i - c_i$ respectively. If the objective function $M(\xi )$ satisfies certain regularity conditions, it can be shown that $\xi _i$ will converge to the optimal value $\xi ^*$ as $n \rightarrow \infty $. Blum (1954) show that this convergence is almost sure provided that some other conditions hold, most importantly that the diminishing sequences $a_i$ and $c_i$ satisfy

1.
$c_i \rightarrow 0$ as $i \rightarrow \infty $;
2.
$\sum _{i=0}^\infty a_i = \infty $;
3.
$\sum _{i=0}^\infty a_i c_i < \infty $;
4.
$\sum _{i=0}^\infty a_i^2c_i^{-2} = \infty $.

We design an adaptive MCMC sampler which combines the Kiefer–Wolfowitz scheme and parallel chain implementation. The parallel chain implementation involves a number of independent chains which only share the same tuning parameters and provides independent observations as the Kiefer–Wolfowitz scheme requires. We consider a sampler which involves L parallel chains. In the sampler a new state is proposed through the kernel $Q_\xi $, which is then accepted with probability $\alpha _\xi $. An adaptation scheme with the Kiefer–Wolfowitz updates is given as follows: compute $a_i = i^{-\phi _a}$ and $c_i = i^{-\phi _c}$; calculate $\xi ^+ = \xi _i + c_i$ and $\xi ^- = \xi _i - c_n$; separate the N chains into two groups, $L^-$ and $L^+$,; for each $l \in L^{+}$, propose new state $(\gamma ^{(l,i)})^{\prime }$ using $Q_{\xi ^+}(\gamma ^{(l,i)}, \cdot )$, accept $(\gamma ^{(l,i)})^{\prime }$ with probability $\alpha _{\xi ^+}(\gamma ^{(l,i)}, (\gamma ^{(l,i)})^{\prime }$; for each $l \in L^{-}$, propose new state $(\gamma ^{(l,i)})^{\prime }$ using $Q_{\xi ^-}(\gamma ^{(l,i)}, \cdot )$, accept $(\gamma ^{(l,i)})^{\prime }$ with probability $\alpha _{\xi ^-}(\gamma ^{(l,i)}, (\gamma ^{(l,i)})^{\prime })$; compute the ASJD for the current iteration by averaging over the chains in groups $L^+$ and $L^-$ respectively as follows:

$$\begin{aligned}&\text {ASJD}^{\bullet ,(i)}\nonumber \\&\quad \approx \frac{1}{|L^{\bullet }|} \sum _{l \in L^\bullet } \left( \sum _{j = 1}^p (\gamma ^{(l,i)}_j - (\gamma ^{(l,i)})^{\prime }_j)^2 \right) \alpha _{\theta ^\bullet } (\gamma ^{(l,i)}, (\gamma ^{(l,i)})^{\prime }) \end{aligned}$$

(A.8)

where $\bullet $ is either $+$ or −; update the tuning parameter for the next iteration by

$$\begin{aligned} \xi ^{(i+1)} = \xi ^{(i)} + a_i \left( \frac{\text {ASJD}^{+,(i)} - \text {ASJD}^{-,(i)}}{2c_i}\right) . \end{aligned}$$

(A.9)

B Proofs

1.1 B.1 Proof of Proposition 1

The proof relies on Proposition 1 from Andrieu et al. (2020), which in turn relies on Theorem 3 in the same work. The proposition below is a concise summary of the two results sufficient for our needs.

Proposition 4

Consider a Borel space $(E,\mathcal {E})$ such that any $\xi \in E$ can be written $\xi := (x,y)$ for $x \in \mathsf {X}$ and $y \in \mathsf {Y}$. Define the probability measure $\mu $ on E such that $\mu (d\xi ) := \pi (dx)\nu _x(dy)$. Given an involution $\phi :E \rightarrow E$, then the deterministic Markov kernel

$$\begin{aligned} \Pi (\xi ,d\xi ') := a(\xi )\delta _{\phi (\xi )}(d\xi ') + (1-a(\xi ))\delta _\xi (d\xi ') \end{aligned}$$

is $\mu $-reversible, where $a(\xi ) := \min \left( 1, \mu \circ \phi (d\xi )/\mu (d\xi ) \right) $ if $\xi \in S$ and 0 otherwise, for a suitably defined $S \subset E$ (see Theorem 3(a) of Andrieu et al. (2020) for details). In addition, the marginal transition kernel

$$\begin{aligned} P(x,\mathrm{d}x') := \int _{\mathsf {Y}} \Pi ((x,y),(\mathrm{d}x',\mathsf {Y}))\nu _x(\mathrm{d}y) \end{aligned}$$

is both Markov and $\pi $-reversible.

Proof of Proposition 1

Define the probability mass function $\mu (\gamma ,\gamma ',k) := \pi (\gamma )p(k|\gamma )q_k(\gamma ,\gamma ')$. Then the algorithm can be viewed as a Markov chain on the larger space $E := \Gamma \times \Gamma \times \mathcal {K}$, which alternates between the following steps:

1.
Re-sample $k,\gamma '|\gamma $ from its conditional distribution with mass function $p(k|\gamma )q_k(\gamma ,\gamma ')$
2.
Perform a Metropolis step with deterministic proposal
$$\begin{aligned} \phi \left( \begin{array}{c} \gamma \\ \gamma ' \\ k \end{array} \right) = \left( \begin{array}{c} \gamma ' \\ \gamma \\ \rho (k) \end{array} \right) \end{aligned}$$
and acceptance probability $\min \left( 1,\mu \circ \phi (\gamma ,\gamma ',k)/\right. \left. \mu (\gamma ,\gamma ',k)\right) $.

Note that $\phi \circ \phi (\gamma ,\gamma ',k) = (\gamma ,\gamma ',k)$, meaning $\phi $ is an involution. Therefore setting $\mathsf {X} := \Gamma $ and $\mathsf {Y} := \Gamma \times \mathcal {K}$, step 2 can be identified with the deterministic kernel $\Pi $ in Proposition 4 above, and steps 1 and 2 combined can be identified with the marginal kernel P on the state space $\Gamma $, which is therefore $\pi $-reversible. $\square $

1.2 B.2 Proof of Proposition 2

Proof of Proposition 2

Recall the interval $\Delta _\epsilon ^{2p} = (\epsilon ,1-\epsilon )^{2p}$, Since $\eta \in \Delta _\epsilon ^{2p}$, it is clear that $p^{\text {RN}}_\eta (k|\gamma ) > \epsilon ^p$ for all $\gamma \in \Gamma $ and $k \in \mathcal {K}$. Recall that $q^{\text {THIN}}_{\omega , k}$ is symmetric, so we have $q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) = q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma )$. The condition

$$\begin{aligned}&p^{\text {RN}}_\eta (k|\gamma ) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime )> 0\nonumber \\&\quad \iff p^{\text {RN}}_\eta (k|\gamma ^\prime ) q^{\text {THIN}}_{\omega , k}(\gamma ^\prime , \gamma ) > 0 \end{aligned}$$

(B.1)

follows immediately.

To show that $p^{\text {RN}}_\eta (\cdot |\gamma )$ is a probability measure on $\mathcal {K}$ for any $\gamma \in \Gamma $, we also need to show that

$$\begin{aligned} \sum _{k \in \mathcal {K}} p^{\text {RN}}_\eta (k|\gamma ) = 1. \end{aligned}$$

(B.2)

Starting from the right hand side gives

$$\begin{aligned} \sum _{k \in \mathcal {K}} p^{\text {RN}}_\eta (k|\gamma )&= \sum _{k \in \mathcal {K}} \prod _{j = 1}^p p^{\text {RN}}_{\eta ,j}(k_j|\gamma _j) \\&= \prod _{j = 1}^p \left( p^{\text {RN}}_{\eta ,j}(k_j = 0|\gamma _j) + p^{\text {RN}}_{\eta ,j}(k_j = 1|\gamma _j) \right) . \\&= 1 \end{aligned}$$

as required.

We then show $q_{\omega ,k}(\gamma , \cdot )$ is a probability measure on set $N(\gamma , k)$ for any $\gamma \in \Gamma $ and $k \in \mathcal {K}$. Let J be a projection of k to a set J(k) consisting of the indices j for which $k_j = 1$ (i.e. $J(k) = \{j|k_j = 1\}$). Starting from Eq. (18) of Remark 4 together with the identity used above $\int _{\Gamma }p(d\gamma ) = \prod _j \int _{\Gamma _j}p_j(\mathrm{d}\gamma _j)$, we obtain

$$\begin{aligned}&\sum _{\gamma ^\prime \in N(\gamma ,k)} q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime )\\&\quad = \prod _{j \in J(k)} \sum _{\gamma ^\prime _j \in \{0,1\}} \left( \frac{\omega }{1-\omega }\right) ^{d_H(\gamma _j,\gamma _j^\prime )} (1-\omega ) \\&\quad = \prod _{j \in J(k)} \sum _{\gamma _j^\prime \in \{\gamma _j,1-\gamma _j\}} \left( \frac{\omega }{1-\omega }\right) ^{|\gamma _j - \gamma _j^\prime |} (1-\omega ) \\&\quad = \prod _{j \in J(k)} \left( (1-\omega ) + \omega \right) \\&\quad = 1 \end{aligned}$$

as required. $\square $

1.3 B.3 Proof of Theorem 1

Before proving Theorem 1, we first draw some conclusions on acceptance probability of the ASI and ARN proposals.

Proposition 5

Suppose $\gamma $ is the current state, $\gamma ^\prime $ is the proposed state and $\eta \in \Delta _\epsilon ^{2p} = (\epsilon ,1-\epsilon )^{2p}$ is fixed parameter. For any k that constructs neighbourhood containing $\gamma $ and $\gamma ^\prime $ and any choices of $\xi \in \Delta _\epsilon = (\epsilon ,1-\epsilon )$ and $\omega \in \Delta _\epsilon $, the Metropolis-Hastings acceptance probability of the ARN proposal, $\alpha _{(\xi \eta ,\omega ), k}^{\text {ARN}}$, as in (17) is fixed. In addition, this term is also the acceptance probability of the ASI proposal, i.e.

$$\begin{aligned} \alpha _{(\xi \eta ,\omega ), k}^{\text {ARN}} (\gamma , \gamma ^\prime ) = \alpha _{\zeta \eta }^{\text {ASI}} (\gamma , \gamma ^\prime ). \end{aligned}$$

for any $\zeta \in \Delta _\epsilon $.

Proof of Proposition 5

Suppose that $\gamma $, $\gamma ^\prime \in \Gamma $, $\eta \in \Delta _\epsilon ^{2p}$ are given and fixed. We consider all the $k \in \mathcal {K}$ such that $\gamma ^\prime \in N(\gamma , k)$. We are going to show that for any $\xi $ and $\omega \in \Delta _\epsilon $, the acceptance probability $\alpha _{\theta , k}^{\text {ARN}} (\gamma , \gamma ^\prime )$ is free from the choice of k, $\xi $ and $\omega $.

To locate the different positions between $\gamma $ and $\gamma ^\prime $, we define the set $J(\gamma , \gamma ^\prime ) := \{j| \gamma _j \ne \gamma _j^\prime \} \subseteq J(k)$. From (17), we have

$$\begin{aligned} \alpha _{\theta , k}^{\text {ARN}} (\gamma , \gamma ^\prime )&= \min \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^\text {opt}}(k|\gamma ^\prime )q_{\omega ,k}^{\text {THIN}}(\gamma ^\prime ,\gamma )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^\text {opt}}(k|\gamma )q_{\omega , k}^{\text {THIN}}(\gamma ,\gamma ^\prime )} \right\} \\&= \min \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\xi \eta ^\text {opt}}(k|\gamma ^\prime )}{\pi (\gamma )p^{\text {RN}}_{\xi \eta ^\text {opt}}(k|\gamma )} \right\} \end{aligned}$$

where the last equality follows since $q^{\text {THIN}}_{\omega , k}$ is symmetric. Substituting $p^{\text {RN}}_{\xi \eta }$ yields

$$\begin{aligned}&\alpha _{\theta , k}^{\text {ARN}} (\gamma , \gamma ^\prime ) \\&\quad =\min \left\{ 1, \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} \prod _{j=1}^{p} \frac{(\xi A_j)^{(1-\gamma ^\prime _j)k_j} (1-\xi A_j)^{(1-\gamma ^\prime _j)(1-k_j)} (\xi D_j)^{\gamma ^\prime _jk_j} (1-\xi D_j)^{\gamma ^\prime _j(1-k_j)}}{(\xi A_j)^{(1-\gamma _j)k_j} (1-\xi A_j)^{(1-\gamma _j)(1-k_j)} (\xi D_j)^{\gamma _jk_j} (1-\xi D_j)^{\gamma _j(1-k_j)}} \right\} \\&\quad = \min \left\{ 1, \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} \prod _{j \in J(\gamma , \gamma ^\prime )} \frac{ (A_j)^{(1-\gamma ^\prime _j)} (D_j)^{\gamma ^\prime _j} }{(A_j)^{(1-\gamma _j)} (D_j)^{\gamma _j} } \right\} . \end{aligned}$$

The value of $\alpha _{\theta , k}^{\text {ARN}} (\gamma , \gamma ^\prime )$ only depends on the choice of $\eta $ and it does not involve the terms $\xi $, $\omega $ and k. Therefore, the proposition is proved. In addition, this is also the Metropolis-Hastings acceptance probability for ASI of proposing $\gamma ^\prime $ to $\gamma $ following the definition of the ASI sampler. $\square $

Now we formally prove Theorem 1.

Proof of Theorem 1

Rewriting the transition kernel of $p^{\mathrm {ARN}}_{(\xi \eta ,\omega )}$ gives

$$\begin{aligned} p^{\mathrm {ARN}}_{(\xi \eta ,\omega )} (\gamma , \gamma ^\prime )&= \sum _{k \in \mathcal {K}} p_{\xi \eta }^\text {RN} (k|\gamma ) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) \alpha ^{\text {ARN}}_{(\xi \eta ,\omega ), k}(\gamma \gamma ^\prime ) \nonumber \\&= \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \sum _{k \in \mathcal {K}} p_{\xi \eta }^\text {RN} (k|\gamma ) q^{\text {THIN}}_{\omega , k}(\gamma , \gamma ^\prime ) \end{aligned}$$

(B.3)

for which the last line follows from Proposition 5 and the fact that $\zeta = \xi \times \omega $.

Recall the definitions of the conditional distribution of k in (15) and the within-neighbourhood proposal in (18). Together with $\eta = (A, D)$, we have

$$\begin{aligned} p^{\text {RN}}_{\xi \eta } (k|\gamma )&= \prod _{j = 1}^p p^{\text {RN}}_{\xi \eta , j}(k_j|\gamma _j) \\&= \prod _{j = 1}^p (\xi A_j)^{(1-\gamma _j)k_j} (1-\xi A_j)^{(1-\gamma _j)(1-k_j)} \\&\quad (\xi D_j)^{\gamma _j k_j} (1-\xi D_j)^{\gamma _j(1-k_j)} \end{aligned}$$

and

$$\begin{aligned} q^{\text {THIN}}_{\omega , k}(\gamma ,\gamma ^\prime )&= \left( \frac{\omega }{1-\omega }\right) ^{d_H(\gamma , \gamma ^\prime )} (1-\omega )^{\sum _{j=1}^p k_j} \\&= \left( \frac{\omega }{1-\omega }\right) ^{d_H(\gamma , \gamma ^\prime )} \prod _{j = 1}^p (1-\omega )^{k_j} \end{aligned}$$

where the last line follows since $(1-\omega )^{k_j} = 1$ if $k_j = 0$. Substituting the above into (B.3) yields

$$\begin{aligned}&p^{\mathrm {ARN}}_{(\xi \eta ,\omega )} (\gamma , \gamma ^\prime ) = \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \nonumber \\&\quad \sum _{k \in \mathcal {K}} \left[ \prod _{j=1}^p (\xi A_j)^{(1-\gamma _j)k_j} (1-\xi A_j)^{(1-\gamma _j)(1-k_j)} \right. \nonumber \\&\qquad \left. (\xi D_j)^{\gamma _jk_j} (1-\xi D_j)^{\gamma _j(1-k_j)} \left( \frac{\omega }{1-\omega }\right) ^{d_H(\gamma , \gamma ^\prime )} (1-\omega )^{k_j} \right] . \end{aligned}$$

(B.4)

Let $J(\gamma , \gamma ^\prime )$ be a set which consists of the indices j for which $\gamma _j \ne \gamma _j^\prime $ and $J(\gamma , \gamma ^\prime )^c$ be the complement to $J(\gamma , \gamma ^\prime )$. By definition $k_j$ must be 1 when $j \in J(\gamma , \gamma ^\prime )$ and $k_j$ can take values either 0 or 1 when $j \in J(\gamma , \gamma ^\prime )^c$. Continuing from (B.4), we divide $j=1$ to p into two groups, $J(\gamma , \gamma ^\prime )$ and $J(\gamma , \gamma ^\prime )^c$, and obtain

$$\begin{aligned}&p^{\mathrm {ARN}}_{(\xi \eta ,\omega )} (\gamma , \gamma ^\prime ) \\&\quad = \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \left( \frac{\omega }{1-\omega }\right) ^{|J(\gamma , \gamma ^\prime )|}\\&\qquad \prod _{j \in J(\gamma , \gamma ^\prime )} \left( (\xi A_j)^{(1-\gamma _j)} (\xi D_j)^{\gamma _j} (1-\omega ) \right) \\&\qquad \times \prod _{j \in J(\gamma , \gamma ^\prime )^c} \Bigg [ \sum _{k_j \in \{0,1\}} (\xi A_j)^{(1-\gamma _j)k_j} (1-\xi A_j)^{(1-\gamma _j)(1-k_j)} \\&\qquad (\xi D_j)^{\gamma _j k_j} (1-\xi D_j)^{\gamma _j(1-k_j)} (1-\omega )^{k_j} \Bigg ] \\&\quad = \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \prod _{j \in J(\gamma , \gamma ^\prime )} \left( (\xi A_j)^{(1-\gamma _j)} (\xi D_j)^{\gamma _j} \cdot \omega \right) \prod _{j \in J(\gamma ,\gamma ^\prime )^c} \\&\qquad \times \underbrace{ \left( (1-\xi A_j)^{1-\gamma _j} (1-\xi D_j)^{\gamma _j} + (\xi A_j)^{1-\gamma _j} (\xi D_j)^{\gamma _j} (1-\omega ) \right) }_{I_j} \end{aligned}$$

We am going to further investigate the terms $I_j$ for $j \in J(\gamma ,\gamma ^\prime )^c$. Clearly that, if $\gamma _j = 1$, then

$$\begin{aligned} I_j&= (1-\xi D_j) + \xi D_j (1-\omega ) \\&= 1 - \omega \xi D_j, \end{aligned}$$

similarly that when $\gamma _j = 0$, we have

$$\begin{aligned} I_j&= (1-\xi A_j) + \xi A_j (1 - \omega ) \\&= 1 - \omega \xi A_j. \end{aligned}$$

Putting everything back to $p^{\mathrm {ARN}}_{\xi \eta ,\omega }$, and reconstructing the product in j from 1 to p gives

$$\begin{aligned} p^{\mathrm {ARN}}_{\xi \eta ,\omega } (\gamma , \gamma ^\prime )&= \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \times \prod _{j \in J(\gamma , \gamma ^\prime )} (\xi \omega A_j)^{(1-\gamma _j)} (\xi \omega D_j)^{\gamma _j} \\&\quad \times \prod _{j \in J(\gamma , \gamma ^\prime )^c} (1-\xi \omega A_j)^{(1-\gamma _j)} (1-\xi \omega D_j)^{\gamma _j} \\&= \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \times \prod _{j =1}^p \left[ (\xi \omega A_j)^{(1-\gamma _j)\gamma _j^\prime } (\xi \omega D_j)^{\gamma _j(1-\gamma _j^\prime )}\right. \\&\quad \left. (1-\xi \omega A_j)^{(1-\gamma _j)(1-\gamma _j^\prime )} (1-\xi \omega D_j)^{\gamma _j \gamma _j^\prime } \right] . \end{aligned}$$

Rewriting the above in terms of $\zeta = \xi \times \omega $ yields

$$\begin{aligned} p^{\text {ARN}}_{(\xi \eta ,\omega )} (\gamma , \gamma ^\prime )&= \alpha ^{\text {ASI}}_{\zeta \eta }(\gamma , \gamma ^\prime ) \\&\quad \times \prod _{j =1}^p \left[ (\zeta A_j)^{(1-\gamma _j)\gamma _j^\prime } (\zeta D_j)^{\gamma _j(1-\gamma _j^\prime )} \right. \\&\quad \times \left. (1-\zeta A_j)^{(1-\gamma _j)(1-\gamma _j^\prime )} (1-\zeta D_j)^{\gamma _j \gamma _j^\prime }\right] \\&= p^{\text {ASI}}_{\zeta \times \eta } (\gamma , \gamma ^\prime ) \end{aligned}$$

as required. $\square $

1.4 B.4 Proof of Corollary 1

Proof of Corollary 1

It is clear that the argument holds from the Theorem 1 and its proof. $\square $

1.5 B.5 Proof of Proposition 3

Proof

We define J(k) to be a set that consists of the positions that $k_j = 1$ (i.e. $J(k) = \{j|k_j = 1\}$) and $J(\gamma , \gamma ^\prime )$ to be a set which consists of the different variables (i.e. $J(\gamma , \gamma ^\prime ) = \{j| \gamma _j \ne \gamma _j^\prime \}$). They obey the relationship $J(\gamma , \gamma ^\prime ) \subseteq J(k)$ if $\gamma ^\prime \in N(\gamma , k)$. The j-th conditional distribution of $k_j$, $p^{\text {RN}}_{\eta ,j}$, satisfies

$$\begin{aligned} p^{\text {RN}}_{\eta ,j}(k_j|\gamma _j) = p^{\text {RN}}_{\eta ,j}(k_j|\gamma ^\prime _j) \end{aligned}$$

if j is outside of $J(\gamma ,\gamma ^\prime )$. This is because $\gamma _j = \gamma ^\prime _j$ for $j \in J(\gamma ,\gamma ^\prime )$.

Start with simplifying the ratio $p^{\text {RN}}_{\eta }(k|\gamma ^\prime )/p^{\text {RN}}_{\eta }(k|\gamma )$. Following the similar steps in the proof of Proposition 5 and suppose $\gamma _j = 1-\gamma _j^\prime $ for all $j \in J(\gamma ,\gamma ^\prime )$ and $\eta = (A,D)$, we can show the following

$$\begin{aligned} \frac{p^{\text {RN}}_\eta (k|\gamma ^\prime )}{p^{\text {RN}}_\eta (k|\gamma )}&= \prod _{j \in J(\gamma , \gamma ^\prime )} \frac{A_j^{1-\gamma _j} D_j^{\gamma _j}}{A_j^{\gamma _j} D_j^{1-\gamma _j}} \nonumber \\&= \prod _{j \in J(\gamma , \gamma ^\prime )} \left( \frac{A_j}{D_j}\right) ^{1-2\gamma _j} = \prod _{j \in J(\gamma , \gamma ^\prime )} \left( \frac{A_j}{D_j}\right) ^{2\gamma ^\prime _j-1}. \end{aligned}$$

(B.5)

Next step is simplifying the second ratio $q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma ) / q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )$ and showing that this term can be decomposed into 3 parts. Since the sampling process is sequential, the model $\gamma (r)$ is proposed from $\gamma (r-1)$ at time r. For reversed move, the model $\gamma ^\prime (r)$ is proposed from $\gamma ^{\prime }(r-1)$ at time r. Moreover, $\gamma (0)$ and $\gamma ^\prime (p_k)$ are the current state $\gamma $ and $\gamma ^\prime (p_k)$ and $\gamma ^{\prime }(0)$ are the final proposal $\gamma ^\prime $. We correlate r and $r^\prime = p_k-r+1$ since $\gamma (r) = \gamma ^\prime (r^\prime -1)$ and $K_{r} = K^\prime _{r^\prime }$. We consider the ratio $q^{\text {PARNI}}_{\theta ,K^{\prime }_{r^\prime }}(\gamma ^\prime (r^\prime -1), \gamma ^\prime (r^\prime )) / q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r))$. From (28) and therefore have

$$\begin{aligned}&\frac{q^{\text {PARNI}}_{\theta ,K_{r^\prime }}(\gamma ^\prime (r^\prime -1), \gamma ^{\prime }(r^\prime ))}{q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r))} \\&= \frac{g\left( \frac{\pi (\gamma ^{\prime }(r^\prime ))p^{\text {RN}}_{\eta }(e(K_{r^\prime })|\gamma ^{\prime }(r^\prime ))}{\pi (\gamma ^\prime (r^\prime -1))p^{\text {RN}}_{\eta }(e(K_{r^\prime })|\gamma ^\prime (r^\prime -1))}\right) /Z^\prime (r^\prime )}{g\left( \frac{\pi (\gamma (r))p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r))}{\pi (\gamma (r-1))p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r-1))}\right) /Z(r)}. \end{aligned}$$

Since g is a balancing function and satisfies $g(t) = tg(1/t)$ for any positive t and $\gamma (r) = \gamma ^\prime (r^\prime +1)$, we have

$$\begin{aligned}&\frac{q^{\text {PARNI}}_{\theta ,K_{r^\prime }}(\gamma ^\prime (r^\prime -1), \gamma ^{\prime }(r^\prime ))}{q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r))} \\&= \frac{\pi (\gamma (r-1))p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r-1))}{\pi (\gamma (r))p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r))} \cdot \frac{Z(r)}{Z^\prime (r^\prime )}. \end{aligned}$$

The product of the above ratio from $r=1$ to $p_k$ yields the term $q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma ) / q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )$ as follows

$$\begin{aligned}&\frac{q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{ q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )} \\&\quad = \frac{\prod _{r^\prime =1}^{p_k} q^{\text {PARNI}}_{\theta ,K_{r^\prime }}(\gamma ^\prime (r^\prime -1), \gamma ^{\prime }(r^\prime ))}{\prod _{r=1}^{p_k} q^{\text {PARNI}}_{\theta ,K_r}(\gamma (r-1), \gamma (r))} \\&\quad = \prod _{r=1}^{p_k} \frac{\pi (\gamma (r-1))p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r-1))}{\pi (\gamma (r))p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r))} \cdot \frac{Z(r)}{Z^\prime (r^\prime )} \\&\quad = \underbrace{ \left( \prod _{r=1}^{p_k} \frac{\pi (\gamma (r-1))}{\pi (\gamma (r))} \right) }_{\text {I}} \cdot \underbrace{ \left( \prod _{r=1}^{p_k} \frac{p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r-1))}{p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r))} \right) }_{\text {II}} \cdot \\&\qquad \cdot \underbrace{ \left( \prod _{r=1}^{p_k} \frac{Z(r)}{Z^\prime (r)} \right) }_{\text {III}} \end{aligned}$$

since $r^\prime = p_k-r$.

The first term $\text {I}$ is equal to

$$\begin{aligned} \text {I} = \frac{\pi (\gamma )}{\pi (\gamma (1))} \frac{\pi (\gamma (1))}{\pi (\gamma (2))} \cdots \frac{\pi (\gamma (p_k-2))}{\pi (\gamma (p_k-1))} \frac{\pi (\gamma (p_k-1))}{\pi (\gamma ^\prime )}. \end{aligned}$$

Most terms can be cancelled out and this leaves the first numerator and the last denominator

$$\begin{aligned} \text {I} = \frac{\pi (\gamma )}{\pi (\gamma ^\prime )}. \end{aligned}$$

We now deal with the second term $\text {II}$. Substituting the values gives

$$\begin{aligned} \frac{p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r-1))}{p^{\text {RN}}_{\eta }(e(K_{r})|\gamma (r)))} = \left( \frac{A_{K_{r}}}{D_{K_{r}}}\right) ^{1-2\gamma (r)_{K_{r}}}. \end{aligned}$$

We know that the positions $K_1,\ldots ,K_{p_k}$ are distinct and the vectors $e(K_1),\ldots ,e(K_{p_k})$ can recover the auxiliary variable k. Therefore, we obtain

$$\begin{aligned} \text {II} = \prod _{j \in J(\gamma , \gamma ^\prime )} \left( \frac{A_j}{D_j}\right) ^{1-2\gamma _j}. \end{aligned}$$

Following the above arguments, the product of sequence $j = 1, \ldots , p_k$ can be simplified

$$\begin{aligned} \frac{q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )} = \underbrace{\frac{\pi (\gamma )}{\pi (\gamma ^\prime )}}_{\text {I}} \cdot \underbrace{\prod _{j \in J(\gamma , \gamma ^\prime )} \left( \frac{A_j}{D_j}\right) ^{1-2\gamma _j} }_{\text {II}} \cdot \underbrace{\prod _{r = 1}^{p_k} \frac{Z(r)}{Z^\prime (r)}}_{\text {III}} \end{aligned}$$

(B.6)

The Metropolis-Hastings acceptance probability in (31) is

$$\begin{aligned}&\alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) \\&\quad = \left\{ 1, \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\eta }(k|\gamma ^\prime )q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime , \gamma )}{\pi (\gamma )p^{\text {RN}}_{\eta }(k|\gamma )q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )} \right\} \\&\quad = \min \left\{ 1, \left( \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\cdot \text {I}\right) \cdot \left( \frac{p^{\text {RN}}_{\eta }(k|\gamma ^\prime )}{p^{\text {RN}}_{\eta }(k|\gamma )}\cdot \text {II}\right) \cdot \text {III} \right\} \end{aligned}$$

From (B.5) and (B.6), we have

$$\begin{aligned}&\frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\cdot \text {I} = 1 \\&\frac{p^{\text {RN}}_{\eta }(k|\gamma ^\prime )}{p^{\text {RN}}_{\eta }(k|\gamma )}\cdot \text {II} = 1 \end{aligned}$$

and we therefore obtain

$$\begin{aligned} \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )&= \min \{1, \text {III}\} \\&= \min \left\{ 1, \prod _{j=1}^{p_k} \frac{Z(j)}{Z^\prime (j)} \right\} \end{aligned}$$

as required. $\square $

1.6 B.6 Proof of Lemma 1

The proof of Lemma 1 is structured as proof of Lemma 1 in Griffin et al. (2021). We first recall a tailored version of a general result that is well-known in the literature, see e.g. Theorem 8 of Roberts and Rosenthal (2004), Theorem 1 of Roberts and Rosenthal (2007).

Proposition 6

Consider a family of $\pi $-invariant Markov kernels $\{P_\theta \}_{\theta \in \Theta }$ defined on some countable state space $\Gamma $. If there exists $\epsilon >0$ such that for every $(\gamma ,\gamma ')\in \Gamma \times \Gamma $ it holds that

$$\begin{aligned} \inf _\theta P_\theta (\gamma ,\gamma ') \ge \epsilon \pi (\gamma '), \end{aligned}$$

(B.7)

then the family $\{P_\theta \}_{\theta \in \Theta }$ satisfies a simultaneous uniform ergodicity condition. Namely, for every $\delta >0$ there exists a finite $N := N(\delta )$ such that

$$\begin{aligned} \Vert P^N_\theta (\gamma ,\cdot )-\pi (\cdot ) \Vert _{TV} < \delta \end{aligned}$$

for all $\gamma \in \Gamma $ and all $\theta \in \Theta $.

Proof of Lemma 1

We first introduce some preliminary work. Since $\theta = (\eta , \omega ) \in \Delta _\epsilon ^{2p+1} = (\epsilon ,1-\epsilon )^{2p+1}$, we have

$$\begin{aligned} \epsilon ^p \le p^{\text {RN}}_{\eta }(k|\gamma ) \le (1-\epsilon )^p \end{aligned}$$

(B.8)

for any k and $\gamma $. Similar arguments implies that

$$\begin{aligned} \left( \frac{\epsilon }{1-\epsilon } \right) ^p \le \frac{p^{\text {RN}}_{\eta }(k|\gamma ^\prime )}{p^{\text {RN}}_{\eta }(k|\gamma )} \le \left( \frac{1-\epsilon }{\epsilon } \right) ^p. \end{aligned}$$

(B.9)

From assumption (A.2), we know that there exists a constant $\Pi $ such that

$$\begin{aligned} \frac{1}{\Pi } \le \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} \le \Pi . \end{aligned}$$

(B.10)

Let $t_{\gamma , \gamma ^\prime , k}$ be

$$\begin{aligned} t_{\gamma , \gamma ^\prime , k} = \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} \cdot \frac{p^{\text {RN}}_{\eta }(k|\gamma ^\prime )}{p^{\text {RN}}_{\eta }(k|\gamma )}. \end{aligned}$$

(B.11)

Using (B.9) and (B.10) leads to

$$\begin{aligned} \frac{1}{\Pi } \cdot \left( \frac{\epsilon }{1-\epsilon } \right) ^p \le t_{\gamma , \gamma ^\prime , k} \le \Pi \cdot \left( \frac{1-\epsilon }{\epsilon } \right) ^p, \end{aligned}$$

(B.12)

for any $\gamma ,\gamma ^\prime \in \Gamma $ and $k \in \mathcal {K}$, thus,

$$\begin{aligned} g\left( \frac{1}{\Pi } \cdot \left( \frac{\epsilon }{1-\epsilon } \right) ^p\right) \le g(t_{\gamma , \gamma ^\prime , k}) \le g \left( \Pi \cdot \left( \frac{1-\epsilon }{\epsilon } \right) ^p \right) \end{aligned}$$

since g is a non-decreasing function. We define the following qualities

$$\begin{aligned} g^\uparrow&:= g \left( \Pi \cdot \left( \frac{1-\epsilon }{\epsilon } \right) ^p \right) \\ g^\downarrow&:= g\left( \frac{1}{\Pi } \cdot \left( \frac{\epsilon }{1-\epsilon } \right) ^p\right) \end{aligned}$$

Therefore, for all normalising constants Z(r), it is bounded between $Z^{\downarrow }$ and $Z^{\uparrow }$ where

$$\begin{aligned} Z^{\downarrow }&:= 2\epsilon g^{\downarrow } \end{aligned}$$

(B.13)

$$\begin{aligned} Z^{\uparrow }&:= 2(1-\epsilon ) g^{\uparrow } \end{aligned}$$

(B.14)

due to the fact that $\omega \in (\epsilon , 1- \epsilon )$.

Suppose k and corresponding K are given, we now bound each individual kernel $q^{\text {PARNI}}_{\theta , K_r}(\gamma , \gamma ^\prime )$ for any $\gamma $ and r from 1 to $p_k$. Staring from the definition in (28),

$$\begin{aligned} q^{\text {PARNI}}_{\theta , K_r}(\gamma , \gamma ^\prime ) =&g\left( \frac{\pi (\gamma ^\prime )p^{\text {RN}}_{\eta }(e(K_r)|\gamma ^\prime )}{\pi (\gamma )p^{\text {RN}}_{\eta }(e(K_r)|\gamma )}\right) \nonumber \\&\quad q^{\text {THIN}}_{\omega , e(K_r)}(\gamma , \gamma ^\prime ) / Z(r) \nonumber \\ \ge&\frac{\epsilon g^\downarrow }{Z(r)} \nonumber \\ \ge&\frac{\epsilon g^\downarrow }{2(1-\epsilon )g^\uparrow } \end{aligned}$$

(B.15)

where the last is followed by (B.14). We next consider the full update kernel of PARNI

$$\begin{aligned} q_{\theta , k}^{\text {PARNI}}(\gamma , \gamma ^\prime ) = \prod _{r = 1}^{p_k} q^{\text {PARNI}}_{\theta , K_r}(\gamma (r-1), \gamma (r)) \end{aligned}$$

(B.16)

where $\gamma (0) = \gamma $ and $\gamma (p_k) = \gamma ^\prime $. From (B.15), the full update kernel is bounded as follows

$$\begin{aligned} q_{\theta , k}^{\text {PARNI}}(\gamma , \gamma ^\prime ) \ge \left( \frac{\epsilon g^\downarrow }{2(1-\epsilon )g^\uparrow } \right) ^{p_k} \nonumber \\ \ge \left( \frac{\epsilon g^\downarrow }{2(1-\epsilon )g^\uparrow } \right) ^{p} \end{aligned}$$

(B.17)

since $p_k \le p$ for all k. We also bound the Metropolis-Hastings acceptance probability in (31) from below

$$\begin{aligned} \alpha _{\theta , k}^\text {PARNI}(\gamma , \gamma ^\prime )&= \min \left\{ 1, \frac{\pi (\gamma ^\prime ) p^{\text {RN}}_\eta (k|\gamma ^\prime ) q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime ,\gamma )}{\pi (\gamma ) p^{\text {RN}}_\eta (k|\gamma ) q^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )}\right\} \\&\ge \pi (\gamma ^\prime ) p^{\text {RN}}_\eta (k|\gamma ^\prime ) q^{\text {PARNI}}_{\theta , k}(\gamma ^\prime ,\gamma ) \\&\ge \pi _m \left( \frac{\epsilon g^\downarrow }{2(1-\epsilon )g^\uparrow } \right) ^p \epsilon ^p \end{aligned}$$

where $\pi _m = \min _\gamma \pi (\gamma )$.

Finally, we can chose b such that

$$\begin{aligned} b = \pi _m \left( \frac{\epsilon ^2 g^\downarrow }{2(1-\epsilon )g^\uparrow } \right) ^{2p} \end{aligned}$$

and therefore

$$\begin{aligned} p^{\text {PARNI}}_{\theta }(\gamma ,\gamma ^\prime )&= \sum _{k \in \mathcal {K}} p^{\text {PARNI}}_{\theta , k}(\gamma ,\gamma ^\prime ) \\&= \sum _{k \in \mathcal {K}} q^{\text {PARNI}}_{\theta , k}(\gamma ,\gamma ^\prime ) \alpha ^{\text {PARNI}}_{\theta , k}(\gamma ,\gamma ^\prime ) \\&\ge \sum _{k \in \mathcal {K}} b \ge b. \end{aligned}$$

Since $\pi (\gamma ') < 1$ for any $\gamma ' \in \Gamma $, it follows that Eq. (B.7) is satisfied for $\{p^{\text {PARNI}}_{\theta }\}_{\theta \in \Theta }$, and therefore by Proposition 6 the family of kernels is simultaneously uniformly ergodic.

For L multiple chains, the arguments are similar, but instead, the target distribution is now $\pi ^{\otimes L}(\gamma ^{\otimes L})$ for $\gamma ^{\otimes L} \in \Gamma ^{\otimes L}$ and $\Gamma ^{\otimes L}$ is $(1, b^{\otimes L}, \pi ^{\otimes L}(\cdot ))$-small for which

$$\begin{aligned} b^{\otimes L} = \pi _m^L \left( \frac{\epsilon ^2 g^\downarrow }{2(1-\epsilon )g^\uparrow } \right) ^{2pL}. \end{aligned}$$

$\square $

1.7 B.7 Proof of Lemma 2

Before proving the lemma, we require the following inequalities and its generalised version

Lemma 3

$$\begin{aligned} \bigg |\prod _{j=1}^p a_j - \prod _{j=1}^p b_j\bigg | \le \sum _{j=1}^p |a_j - b_j| \end{aligned}$$

(B.18)

for all $a_j$, $b_j \in [0,1]$.

Proof of Lemma 3

Let D(p) is the LHS of (B.18)

$$\begin{aligned} D(p) = \prod _{j=1}^p a_j - \prod _{j=1}^p b_j. \end{aligned}$$

(B.19)

In addition, we define a telescopic sum S(k) such that

$$\begin{aligned} S(p) = \sum _{i=1}^p a_1 \times \cdots \times a_{i-1}(a_i-b_i)b_{i+1}\times \cdots \times b_p. \end{aligned}$$

(B.20)

The proof of Lemma 3 is structured as follows: (1) show that $D(p) = S(p)$ for $p \ge 2$ by induction; (2) using triangular inequality and condition that all $a_j$ and $b_j$ are bounded between 0 and 1 to prove the LHS of (B.18) is equal to its RHS.

Step (1), we prove

$$\begin{aligned} D(p) = S(p) \end{aligned}$$

(B.21)

for $p \ge 2$ by induction.

Base case: When $p = 2$, rearranging the RHS of (B.21)

$$\begin{aligned} S(2)&= (a_1-b_1)b_2 + a_1(a_2-b_2) \\&= a_1 b_2 - b_1b_2 + a_1 a_2 - a_1b_2 \\&= a_1 a_2 - b_1b_2 = D(2) \end{aligned}$$

which equals to the RHS. We therefore proved that (B.21) holds when $p = 2$.

Induction step: Let $k \le 2$ be given and suppose (B.21) is true for $p = k$. Then

$$\begin{aligned} D(k+1)&= \prod _{j=1}^{k+1} a_j - \prod _{j=1}^{k+1} b_j \\&= \left( \prod _{j=1}^{k} a_j\right) a_{k+1} - \left( \prod _{j=1}^k b_j\right) b_{k+1}. \end{aligned}$$

By applying $(a_1-b_1)b_2 + a_1(a_2-b_2) = a_1 a_2 - b_1b_2$ which we just showed, we obtain

$$\begin{aligned} D(k+1)&= \left( \prod _{j=1}^{k} a_j - \prod _{j=1}^k b_j\right) b_{k+1} \\&\quad + \left( \prod _{j=1}^{k} a_j\right) \left( a_{k+1}-b_{k+1}\right) \\&= S(k)b_{k+1} + \left( \prod _{j=1}^{k} a_j\right) \left( a_{k+1}-b_{k+1}\right) \\&= S(k+1). \end{aligned}$$

Therefore, (B.21) holds for $p = k+1$.

Conclusion: By the principle of induction, (B.21) holds for $p \ge 2$.

Step (2), we start with the RHS of (B.18) and use the above statement

$$\begin{aligned}&\bigg |\prod _{j=1}^p a_j - \prod _{j=1}^p b_j\bigg | = \bigg |D(p)\bigg | = \bigg |S(p)\bigg | \\&\quad = \bigg |\sum _{i=1}^p a_1 \times \cdots \times a_{i-1}(a_i-b_i)b_{i+1}\times \cdots \times b_p \bigg |. \end{aligned}$$

Applying the triangular inequality gives

$$\begin{aligned}&\bigg |\prod _{j=1}^p a_j - \prod _{j=1}^p b_j\bigg |\\&\quad \le \sum _{i=1}^p \bigg | a_1 \times \cdots \times a_{i-1}(a_i-b_i)b_{i+1}\times \cdots \times b_p \bigg | \\&\qquad \times \sum _{i=1}^p a_1 \times \cdots \times a_{i-1}\bigg |a_i-b_i\bigg |b_{i+1}\times \cdots \times b_p \end{aligned}$$

where the last term follows since all $a_j$ and $b_j$ are non-negative. Because $a_j$ and $b_j$ are bounded between 0 and 1, we then have

$$\begin{aligned} \bigg |\prod _{j=1}^p a_j - \prod _{j=1}^p b_j\bigg |&\sum _{i=1}^p \bigg |a_i-b_i\bigg | \end{aligned}$$

which is the inequality (B.18). Therefore, we completed the proof. $\square $

We can generalise the above lemma and obtain the following.

Lemma 4

If $a_j$, $b_j \le C$ for some constant $C > 0$, then

$$\begin{aligned} \bigg |\prod _{j=1}^p a_j - \prod _{j=1}^p b_j\bigg | \le C_1 \sum _{j=1}^p |a_j - b_j| \end{aligned}$$

(B.22)

for some constant $C_1$. $C_1$ can be chosen to be $C^{p-1}$.

Proof

$$\begin{aligned} \bigg |\prod _{j=1}^p a_j - \prod _{j=1}^p b_j\bigg |&= C^p \bigg |\prod _{j=1}^p \frac{a_j}{C} - \prod _{j=1}^p \frac{b_j}{C}\bigg | := A \end{aligned}$$

As $a_j$, $b_j \le C$, $a_j/C$ and $b_j/C$ are smaller than 1.

$$\begin{aligned} A&\le C^p \sum _{j=1}^p \bigg |\frac{a_j}{C} - \frac{b_j}{C} \bigg | \\&= C^{p-1} \sum _{j=1}^p |a_j - b_j| \end{aligned}$$

as required. $\square $

The following lemma shows the diminishing rate of the proposal thinning parameter $\omega $ for both schemes (the PARNI-KW and PARNI-RM proposals).

Lemma 5

The diminishing rate of adaptive parameter $\omega $ in both (35) and (36) between two consecutive iterations satisfies

$$\begin{aligned} |\omega ^{(i+1)} - \omega ^{(i)}| = \mathcal {O}(i^{-\lambda }) \end{aligned}$$

(B.23)

for some $\lambda > 0$. In particular, setting $\phi _i = i^{-\lambda }$ for $\lambda \in (1/2,1)$ in (35), $a_i = i^{-1}$ and $c_i = i^{-0.5}$ in (36) as suggested, (B.23) holds for $\lambda \in (1/2,1)$ and $\lambda = 0.5$ respectively.

Proof of Lemma 5

The update rule of (35) immediately leads to

$$\begin{aligned} |\omega ^{(i+1)} - \omega ^{(i)}| = \mathcal {O}(i^{-\lambda }) \end{aligned}$$

for $\lambda \in (1/2, 1)$.

For the Kiefer–Wolfowitz updating law in (36), the values of tuning parameters $\omega $ adopted involve diminishing sequence $c_i$. We are therefore interested in

$$\begin{aligned} \bigg ||\omega ^{(i+1)} \pm c_{i+1}| - |\omega ^{(i)} \pm c_i|\bigg |. \end{aligned}$$

(B.24)

By applying the reverse triangle inequality, we obtain

$$\begin{aligned}&\bigg ||\omega ^{(i+1)} \pm c_{i+1}| - |\omega ^{(i)} \pm c_i|\bigg | \\&\quad \le \bigg |(\omega ^{(i+1)} \pm c_{i+1}) - (\omega ^{(i)} \pm c_i)\bigg | \\&\quad \le \bigg |\omega ^{(i+1)} + c_{i+1} - \omega ^{(i)} + c_i\bigg | \\&\quad \le \bigg |\omega ^{(i+1)} - \omega ^{(i)}\bigg | + \bigg |c_{i+1} + c_i\bigg | := S_1 + S_2, \end{aligned}$$

Starting from the first term $S_1$ and rearranging (A.9), we obtain

$$\begin{aligned} S_1 \le&\Bigg | a_i\left( \frac{\text {ASJD}^{+,(i)} - \text {ASJD}^{-,(i)}}{2c_i}\right) \Bigg | \\ \le&p \Bigg | \frac{a_i}{c_i}\Bigg | \\ =&\mathcal {O}(i^{-(\phi _a - \phi _c)}) = \mathcal {O}(i^{-0.5}) \end{aligned}$$

where the second line follows from the fact that the expected jumping distances are bounded above by p.

Substituting the definition of $c_i$ into $S_2$ yields

$$\begin{aligned} S_2 \le&\bigg |(i+1)^{-\phi _c} + (i)^{-\phi _c}\bigg | \\ \le&\bigg |2i^{-\phi _c}\bigg | \\ =&\mathcal {O}(i^{-\phi _c}) = \mathcal {O}(i^{-0.5}). \end{aligned}$$

Since both terms $S_1$ and $S_2$ are of the same order of $\mathcal {O}(i^{-0.5})$, the Eq. (B.24) is also $\mathcal {O}(i^{-0.5})$, which completes the proof. $\square $

We also require the following lemma to bound transition kernels by proposal kernels. The lemma and its proof is inspired by Lemma 4.21 in Łatuszyński et al. (2013).

Lemma 6

The sub-proposal kernel in (39) and sub-transition kernel in (41) obey the following relationship:

$$\begin{aligned}&\sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}} \bigg | p^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - p^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \nonumber \\&\quad \le C \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}} \bigg | \psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ) - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime )\bigg | \end{aligned}$$

(B.25)

for some constant $C < \infty $.

Proof of Lemma 6

Let the left-hand side and right-hand side of (B.25) be L and R respectively, namely

$$\begin{aligned} L&= \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}} \bigg | p^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - p^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \end{aligned}$$

(B.26)

$$\begin{aligned} R&= \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}} \bigg | \psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime )\bigg |. \end{aligned}$$

(B.27)

Starting from the definition of sub-proposal kernel

$$\begin{aligned}&p^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) = \psi ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime ) + \mathbb {I}\{\gamma = \gamma ^\prime \} \\&\quad \sum _{\gamma ^\prime \in \Gamma } \psi ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )(1- \alpha ^{\text {PARNI}}_{\theta , k}(\gamma , \gamma ^\prime )) \end{aligned}$$

and substituting it into the left-hand side of (B.25), we obtain

$$\begin{aligned}&\bigg | p^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ) - p^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \\&\quad \le \bigg | \psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ) \alpha ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ) \\&\qquad - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \alpha ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \\&\qquad + \mathbb {I}\{\gamma = \gamma ^\prime \} \sum _{\gamma ^\prime \in \Gamma } \bigg | \psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime )(1- \alpha ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime )) \\&\qquad - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime )(1- \alpha ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime )) \bigg | := \text {I} + \text {II} \end{aligned}$$

Starting with the term I and substituting in the definition of Metropolis-Hastings acceptance probability in (31) gives

$$\begin{aligned} \text {I} =&\bigg | \min \{\psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ), \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma ^\prime , \gamma )\} \nonumber \\&\quad - \min \{\psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ), \frac{\pi (\gamma ^\prime )}{\pi (\gamma )}\psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma ^\prime , \gamma )\} \bigg | \end{aligned}$$

(B.28)

Using $|\min \{a,b\} - \min \{c,d\}| \le |a-c| + |b-d|$ to further split $\text {I}$

$$\begin{aligned} \text {I}&\le |\psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ) - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime )| \nonumber \\&\quad + \frac{\pi (\gamma ^\prime )}{\pi (\gamma )} |\psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma ^\prime , \gamma ) - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma ^\prime , \gamma )| \nonumber \\&\le (1+\Pi ) R \end{aligned}$$

(B.29)

where the last line follows from the assumption (A.2). An analogous calculation gives

$$\begin{aligned} \text {II} \le (2 + K) R, \end{aligned}$$

(B.30)

which together with (B.29) implies that

$$\begin{aligned} \bigg | p^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - p^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \le (3+2\Pi ) R \end{aligned}$$

for any values of $\gamma $, $\gamma ^\prime $, k and O, hence, it is enough to prove

$$\begin{aligned} L \le C \cdot R \end{aligned}$$

(B.31)

for $C = (3+2\Pi )$ as required. $\square $

The proof of Lemma 2 is structured similarly to the proof of Lemma 2 in Griffin et al. (2021).

Proof of Lemma 2

The proof is structured as follows: firstly, we re-write the problem as a sum of sub-transition kernels and bound the sub-transition kernels by sub-proposal kernels; secondly, we break sub-proposal kernels into various parts; lastly, we bound each part individually and hence bound the proposal kernels.

Starting from the total variation norm in (51) and substituting in the definition of $P^{\text {PARNI}}_{\theta }$ in (40), we have

$$\begin{aligned}&\sup _{\gamma \in \Gamma } \Vert P^{\text {PARNI}}_{\theta ^{(i+1)}}(\gamma , \cdot ) - P^{\text {PARNI}}_{\theta ^{(i)}}(\gamma , \cdot ) \Vert _{TV} \\&\quad = \sup _{\gamma \in \Gamma } \sum _{\gamma ^\prime \in \Gamma } \sum _{k \in \mathcal {K}} \bigg |p^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - p^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \\&\quad \le C_1 \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}} \bigg |p^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - p^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \\&\quad := \text { I} \end{aligned}$$

for some constant $C_1 < \infty $. Using Lemma 6, the problem is reduced to bounding the largest variations in two consecutive proposal kernels

$$\begin{aligned}&\text {I} \le C_2 \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}}\nonumber \\&\quad \bigg |\psi ^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - \psi ^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \end{aligned}$$

(B.32)

for some constant $C_2 < \infty $. Plugging in the definition of $\psi ^{\text {PARNI}}_{\theta , k}$ into (39), therefore,

$$\begin{aligned}&\text {I} \le C_3 \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}}\nonumber \\&\quad \bigg |p^{\text {RN}}_{\eta ^{(i+1)}}(k|\gamma )q^{\text {PARNI}}_{\theta ^{(i+1)}, k }(\gamma , \gamma ^\prime ) - p^{\text {RN}}_{\eta ^{(i)}}(k|\gamma ) q^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg | \end{aligned}$$

(B.33)

for some constant $C_3 < \infty $. Applying Lemma 3 to (B.33), we obtain

$$\begin{aligned}&\text {I} \le C_3 \sup _{\gamma \in \Gamma } \sup _{\gamma ^\prime \in \Gamma } \sup _{k \in \mathcal {K}} \Bigg ( \underbrace{\bigg |p^{\text {RN}}_{\eta ^{(i+1)}}(k|\gamma )- p^{\text {RN}}_{\eta ^{(i)}}(k|\gamma )\bigg |}_{\text {II}}\\&\quad +\underbrace{ \bigg |q^{\text {PARNI}}_{\theta ^{(i+1)}, k}(\gamma , \gamma ^\prime ) - q^{\text {PARNI}}_{\theta ^{(i)}, k}(\gamma , \gamma ^\prime ) \bigg |}_{\text {III}} \Bigg ). \end{aligned}$$

Now we move our attention to the next part of the proof where we are going to bound terms $\text {II}$ and $\text {III}$ respectively.

Starting with $\text {II}$, recall the definition of $p^\text {RN}_{\eta }$ in (15) and $\eta = (A, D)$, we have

$$\begin{aligned} p_{\eta }(k|\gamma )&= \prod _{j=1}^p p_{\eta ,j}(k_j|\gamma _j) \\&= \prod _{j=1}^p \left( A_j \right) ^{(1-\gamma _j)k_j} \left( 1-A_j \right) ^{(1-\gamma _j)(1-k_j)}\\&\quad \left( D_j \right) ^{\gamma _j(1-k_j)} \left( 1-D_j \right) ^{\gamma _j k_j}. \end{aligned}$$

Following similar arguments to the proof of Lemma 2 of Griffin et al. (2021), we obtain

$$\begin{aligned} \text {II} \le&\sum _{j=1}^p \max \left\{ |A_j^{(i+1)} - A_j^{(i)}|, |D_j^{(i+1)} - D_j^{(i)}| \right\} \\ \le&p \max \left\{ \max _j \left\{ |A_j^{(i+1)} - A_j^{(i)}|\right\} , \max _j \left\{ |D_j^{(i+1)} - D_j^{(i)}| \right\} \right\} . \end{aligned}$$

From the definitions of $A_j$ and $D_j$, we have

$$\begin{aligned} |A_j^{(i+1)} - A_j^{(i)}|&= \Bigg | \min \left\{ 1, \frac{ {\tilde{\pi }}^{(i+1)}_j}{(1-{\tilde{\pi }}^{(i+1)}_j)} \right\} \nonumber \\&\quad - \min \left\{ 1, \frac{ {\tilde{\pi }}^{(i)}_j}{(1-{\tilde{\pi }}^{(i)}_j)} \right\} \Bigg | \end{aligned}$$

(B.34)

$$\begin{aligned} |D_j^{(i+1)} - D_j^{(i)}|&= \Bigg | \min \left\{ 1, \frac{(1-{\tilde{\pi }}^{(i+1)}_j)}{ {\tilde{\pi }}^{(i+1)}_j} \right\} \nonumber \\&\quad - \min \left\{ 1, \frac{(1-{\tilde{\pi }}^{(i)}_j)}{ {\tilde{\pi }}^{(i)}_j} \right\} \Bigg |. \end{aligned}$$

(B.35)

The pseudo-code of the PARNI sampler in (3) states that ${\tilde{\pi }}^{(i)}_j = \pi _0 + (1-2\pi _0) {\hat{\pi }}^{(i)}_j$ and ${\hat{\pi }}^{(i)}_j$ are the Rao-Blackwellised estimates of posterior inclusion probabilities $\pi _j$. Recall the definition of $\gamma _{-j}$ which is vector of $\gamma $ without $\gamma _{j}$ (i.e. $\gamma _{-j} = (\gamma _1, \ldots ,\gamma _{j-1}, \gamma _{j+1},\ldots ,\gamma _{p})$). Note that

$$\begin{aligned} {\hat{\pi }}^{(i)}_j = \frac{1}{i} \sum _{\tau =1}^i \Pr (\gamma _j = 1|y, \gamma ^{(\tau )}_{-j}) \end{aligned}$$

(B.36)

where

$$\begin{aligned} \Pr (\gamma _j&= 1|y, \gamma ^{(\tau )}_{-j}) \nonumber \\&=\frac{\pi (\gamma _j = 1, \gamma ^{(\tau )}_{-j}|y)}{\pi (\gamma _j = 1, \gamma ^{(\tau )}_{-j}|y) + \pi (\gamma _j = 0, \gamma ^{(\tau )}_{-j}|y)} \end{aligned}$$

(B.37)

for all j from 1 to p, and therefore

$$\begin{aligned}&|{\hat{\pi }}^{(i+1)}_j - {\hat{\pi }}^{(i)}_j| \\&\quad = \bigg | \frac{i}{i+1}{\hat{\pi }}^{(i)}_j + \frac{1}{i+1} \Pr (\gamma _j = 1|y, \gamma ^{(i+1)}_{-j}) - {\hat{\pi }}^{(i)}_j \bigg | \\&\quad \le \bigg | \frac{i}{i+1}{\hat{\pi }}^{(i)}_j- {\hat{\pi }}^{(i)}_j \bigg | + \frac{1}{i+1} \Pr (\gamma _j = 1|y, \gamma ^{(i+1)}_{-j}) \\&\quad \le \frac{2}{i+1} \end{aligned}$$

Note that $f_{\pi _0}(x) = \min \{1, (\pi _0 + (1-2\pi _0x))/(\pi _0 + (1-2\pi _0(1-x))\}$ is Lipshitz with constant $1/\pi _0$, we obtain

$$\begin{aligned}&\Bigg | \min \left\{ 1, \frac{ {\tilde{\pi }}^{(i+1)}_j}{(1-{\tilde{\pi }}^{(i+1)}_j)} \right\} - \min \left\{ 1, \frac{ {\tilde{\pi }}^{(i)}_j}{(1-{\tilde{\pi }}^{(i)}_j)} \right\} \Bigg |\\&\quad \le \frac{1}{\pi _0} |{\hat{\pi }}^{(i+1)}_j - {\hat{\pi }}^{(i)}_j| \\&\quad \le \frac{1}{\pi _0} \cdot \frac{2}{i+1}. \end{aligned}$$

A similar conclusion can be drawn for each $D_j$, meaning

$$\begin{aligned} \text {II} \le C_4 i^{-1} \end{aligned}$$

(B.38)

for some constant $C_4 < \infty $.

The second term, $\text {III}$, is more complicated. We start by substituting sub proposal kernels in (28) to $\text {III}$ yielding

$$\begin{aligned} \text {III}&= \bigg | \prod _{r=1}^{p_k} q_{\theta ^{(i+1)}, K_r}^{\text {PARNI}} (\gamma (r-1), \gamma (r)) \nonumber \\&\quad - \prod _{r=1}^{p_k} q_{\theta ^{(i)}, K_r}^{\text {PARNI}} (\gamma (r-1), \gamma (r)) \bigg | \end{aligned}$$

(B.39)

where $\gamma (0) = \gamma $ and $\gamma (p_k) = \gamma ^\prime $. Applying Lemma 3 to (B.39), we have

$$\begin{aligned} \text {III}&\le \sum _{r=1}^{p_k} \bigg | q_{\theta ^{(i+1)}, K_r}^{\text {PARNI}} (\gamma (r-1), \gamma (r)) \\&\quad - q_{\theta ^{(i)}, K_r}^{\text {PARNI}} (\gamma (r-1), \gamma (r)) \bigg | \\&\le p\max _{r} \bigg | q_{\theta ^{(i+1)}, K_r}^{\text {PARNI}} (\gamma (r-1), \gamma (r)) \\&\quad - q_{\theta ^{(i)}, K_r}^{\text {PARNI}} (\gamma (r-1), \gamma (r)) \bigg | \\&:= \text {IV}. \end{aligned}$$

The sub-proposal kernels typically contain two moves, either flipping position $K_r$ or keeping it. Term $\text {IV}$ is smaller than the maximum probability of these two moves. Let $\text {V}$ be the absolute difference in flipping and $\text {VI}$ be the absolute difference in keeping, then

$$\begin{aligned} \text {IV}&\le q \max _r \{ \max \left\{ \text {V} , \text {VI} \right\} \}. \end{aligned}$$

We next consider terms $\text {V}$ and $\text {VI}$. Starting with $\text {V}$ and substituting (28) in $\text {V}$ gives

$$\begin{aligned} \text {V}&= \left| \frac{\omega ^{(i+1)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma (r)_{K_r} - \gamma (r-1)_{K_r}} \right) }{Z^{(i+1)}(r)} \right. \\&\quad \left. - \frac{\omega ^{(i)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma (r)_{K_r} - \gamma (r-1)_{K_r}} \right) }{Z^{(i)}(r)}\right| . \end{aligned}$$

Reduce the fractions to a common denominator to yield

$$\begin{aligned} \text {V}&= \left| \frac{\omega ^{(i+1)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) Z^{(i)}(r) - \omega ^{(i)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r)}\right) Z^{(i+1)}(r) }{Z^{(i+1)}(r) Z^{(i)}(r)} \right| \\&\le \frac{1}{(Z^\downarrow )^2} \left| \omega ^{(i+1)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) Z^{(i)}(r) - \omega ^{(i)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) }\right) Z^{(i+1)}(r) \right| \end{aligned}$$

where $\gamma ^d(r) = \gamma (r)_{K_r} - \gamma (r-1)_{K_r}$ and the last line follows from (B.13) in the proof of Lemma 1 where all normalising constants can be bounded above and below. Using Lemma 4, we obtain

$$\begin{aligned} \text {V}&\le C_5 \underbrace{\left| \omega ^{(i+1)} - \omega ^{(i)} \right| }_{:=\text {VII}} \\&\quad + C_6 \underbrace{ \left| g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) - g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) }\right) \right| }_{:=\text {VIII}} \\&\quad + C_7 \underbrace{\left| Z^{(n+1)}(r) - Z^{(n)}(r) \right| }_{:=\text {IX}} \end{aligned}$$

for some constants $C_5, C_6, C_7 < \infty $. We can apply similar arguments to $\text {VI}$ giving

$$\begin{aligned} \text {VI}&= \left| \frac{(1-\omega ^{(i+1)}) g(1)}{Z^{(i+1)}(r)} - \frac{(1-\omega ^{(i)}) g(1)}{Z^{(i)}(r)}\right| \\&\le C_8 \underbrace{\left| \omega ^{(i+1)} - \omega ^{(i)}\right| }_{\text {VII}} + C_{9} \underbrace{ \left| Z^{(i+1)}(r) - Z^{(i)}(r) \right| }_{\text {IX}} \end{aligned}$$

for some constants $C_8, C_{9} < \infty $. Starting with $\text {IX}$, by substituting in the definitions in (29), we have

$$\begin{aligned}&\text {IX} = \left| Z^{(i+1)}(r) - Z^{(i)}(r) \right| \\&\quad = \left| \left( \omega ^{(i+1)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) \right. \right. \\&\qquad \left. \left. + (1-\omega ^{(i+1)}) g(1) \right) \right. \\&\quad - \left( \omega ^{(i)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) \right. \\&\qquad \left. \left. + (1-\omega ^{(i)}) g(1)\right) \right| , \end{aligned}$$

and applying the triangle inequality yields

$$\begin{aligned} \text {IX} \le&\left| \omega ^{(i+1)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) \right. \\&\quad \left. - \omega ^{(i)} g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) \right| \\&\qquad + g(1) \Bigg | \omega ^{(i+1)} - \omega ^{(i)} \Bigg | \\ \le&C_{10} \underbrace{ \Bigg | \omega ^{(i+1)} - \omega ^{(i)} \Bigg |}_{\text {VII}} \\&\quad + C_{11} \underbrace{ \left| g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) - g \left( \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) } \right) \right| }_{\text {VIII}} \end{aligned}$$

for some constants $C_{10}, C_{11} < \infty $. The last line follows after applying Lemma 3.

The diminishing adaptation of tuning parameter $\omega $ is shown in Lemma 5 where

$$\begin{aligned} \text {VII} \le C_{12} i^{-\lambda } \end{aligned}$$

(B.40)

for some constant $C_{12} < \infty $, $\lambda \in (1/2,1)$ for the Robins-Monro adaptation scheme and $\lambda = 0.5$ for the Kiefer–Wolfowitz adaptation scheme.

We now consider term VIII. From assumption (A.1), we have

$$\begin{aligned} g(t_2) - g(t_1) \le C_g (t_2 - t_1) \end{aligned}$$

and therefore

$$\begin{aligned} |g(t_2) - g(t_1)| \le C_g |t_2 - t_1| \end{aligned}$$

(B.41)

for any $t_1,t_2 > 0$. We then have

$$\begin{aligned} \text {VIII}&\le C_g \left| \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) }\right. \\&\quad \left. - \frac{\pi (\gamma (r))}{\pi (\gamma (r-1))} \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) } \right| \\&\le \Pi C_g \left| \left( \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}\right) ^{ \gamma ^d(r) } - \left( \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right) ^{ \gamma ^d(r) } \right| , \end{aligned}$$

where the last line follows from Assumption (A.2). Considering two possible values of $\gamma ^d(r)$, namely 1 and $-1$, we show that $\text {VIII}$ is bounded by the maximum of those values

$$\begin{aligned} \text {VIII}&\le \Pi C_g \max \left\{ \left| \frac{A^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}}- \frac{A^{(i)}_{K_r}}{D^{(i)}_{K_r}}\right| , \left| \frac{D^{(i+1)}_{K_r}}{A^{(i+1)}_{K_r}} - \frac{D^{(i)}_{K_r}}{A^{(i)}_{K_r}} \right| \right\} . \end{aligned}$$

Next, multiplying the common denominator yields

$$\begin{aligned} \text {VIII}&\le \Pi C_g \max \left\{ \Bigg | \frac{A_{K_r}^{(i+1)}D^{(i)}_{K_r} - A^{(i)}_{K_r} D^{(i+1)}_{K_r}}{D^{(i+1)}_{K_r}D^{(i)}_{K_r}}\Bigg | , \right. \\&\quad \left. \Bigg | \frac{D^{(i+1)}_{K_r}A^{(i)}_{K_r} - D^{(i)}_{K_r} A^{(i+1)}_{K_r} }{A^{(i+1)}_{K_r}A^{(i)}_{K_r}}\Bigg | \right\} \\&\le \frac{ \Pi C_g}{\pi _0^2} \max \left\{ \Bigg | A^{(i+1)}_{K_r}D^{(i)}_{K_r} - A^{(i)}_{K_r} D^{(i+1)}_{K_r}\Bigg | ,\right. \\&\quad \left. \Bigg | D^{(i+1)}_{K_r}A^{(i)}_{K_r} - D^{(i)}_{K_r} A^{(i+1)}_{K_r} \Bigg | \right\} , \end{aligned}$$

which holds because $\pi _0 \le A_j, D_j \le 1$ from the proof of Lemma 1. Then, by applying Lemma 3, we have

$$\begin{aligned} \text {VIII}&\le \frac{ \Pi C_g}{\pi _0^2} \max \left\{ \Bigg | A^{(i+1)}_{K_r} - A^{(i)}_{K_r} \Bigg | + \Bigg | D^{(i+1)}_{K_r} - D^{(i)}_{K_r} \Bigg | , \right. \\&\quad \left. \Bigg | A^{(i+1)}_{K_r} - A^{(i)}_{K_r} \Bigg | + \Bigg | D^{(i+1)}_{K_r} - D^{(i)}_{K_r} \Bigg | \right\} \\&\le \frac{\Pi C_g}{\pi _0^2}\left( \Bigg | A^{(i+1)}_{K_r} - A^{(i)}_{K_r} \Bigg | + \Bigg | D^{(i+1)}_{K_r} - D^{(i)}_{K_r} \Bigg | \right) \\&\le C_{13}\left( \max _j \left\{ \Bigg | A^{(i+1)}_{j} - A^{(i)}_{j} \Bigg | \right\} \right. \\&\quad \left. + \max _j \left\{ \Bigg | D^{(i+1)}_{j} - D^{(i)}_{j} \Bigg | \right\} \right) \end{aligned}$$

for some constant $C_{13} < \infty $.

As we have previously showed that

$$\begin{aligned} \max _j \left\{ \Bigg | A^{(i+1)}_{j} - A^{(i)}_{j} \Bigg | \right\}&\le \frac{1}{\pi _0} \cdot \frac{2}{i+1} \end{aligned}$$

(B.42)

$$\begin{aligned} \max _j \left\{ \Bigg | D^{(i+1)}_{j} - D^{(i)}_{j} \Bigg | \right\}&\le \frac{1}{\pi _0} \cdot \frac{2}{i+1}, \end{aligned}$$

(B.43)

this leads to

$$\begin{aligned} \text {VIII}&\le C_{14} \frac{1}{\pi _0} \cdot \frac{2}{i+1} \\&\le C_{15} i^{-1} \end{aligned}$$

for some constants $C_{14},C_{15} < \infty $.

We complete the proof by stating that $\text {IV} \le C_{16} i^{-\lambda }$ for some constant $C_{16} < \infty $, and hence $\text {I} \le C_{17} i^{-\lambda }$ for some constant $C_{17} < \infty $, $\lambda \in (1/2,1)$ for the Robins-Monro adaptation scheme and $\lambda = 0.5$ for the Kiefer–Wolfowitz adaptation scheme, which shows that the diminishing adaptation for the PARNI sampler has established. $\square $

1.8 B.8 Proof of Theorem 2

Proof

Simultaneous uniform ergodicity together with diminishing adaption are enough to show that $\pi $ is stationary for each kernel $P^{\text {PARNI}}_{\theta }$ and the adaptive algorithm is ergodic from Theorem 1 in Roberts and Rosenthal (2007). Its multiple chain version is also ergodic with respect to $\pi ^{\otimes L}$.

The proof of the Strong Law of Large Numbers (SLLN) contains two steps. Firstly, we show that each individual chain satsifies a SLLN, that is

$$\begin{aligned} \frac{1}{N} \sum _{i=0}^{N-1} f(\gamma ^{l,(i)}) \rightarrow \pi (f) \quad \text { almost surely as }N \rightarrow \infty . \end{aligned}$$

(A.44)

Then by averaging over L parallel chains, we can show that the SLLN in (50) is satisfied for the multiple chain version.

We use Theorem 2.7 in Fort et al. (2011) to show that (A.44) holds. To do so, three conditions are required.

(Con.1)
The measurable function V can be chosen to be the constant function $V \equiv 1$, where V-variation distance norm reduces to the total variation distance. It is obvious that the below is met if with $\lambda _\theta = 1/2$, $b_\theta = 1$, the measure $\nu _\theta $ is uniformly distributed on the sample space $\Gamma $, that is
$$\begin{aligned} \nu _\theta (\gamma ) = \frac{1}{2^p}, \end{aligned}$$
with $\delta _\theta = b$ (the lower bound for a single chain in Lemma 1), then
$$\begin{aligned} P^{\text {PARNI}}_\theta V&\le \frac{1}{2}V + 1 \\ P^{\text {PARNI}}_\theta (\gamma , \cdot )&\ge b \nu _\theta (\cdot ) I\{V \le c_\theta \}(\gamma ), \quad \\&\quad c_\theta = 2b_\theta (1-\lambda _\theta )^{-1} - 1 = 3 \end{aligned}$$
is satisfied.
(Con.2)
Using the same parameters specified in (Con.1), the problem is reduced to
$$\begin{aligned} \sum _{i=1}^\infty i^{-1} \sup _{\gamma \in \Gamma }\Vert P^{\text {PARNI}}_{\theta ^{(i+1)}}(\gamma , \cdot ) - P^{\text {PARNI}}_{\theta ^{(i)}}(\gamma , \cdot )\Vert < + \infty . \end{aligned}$$
This is satisfied since the PARNI sampler satisfies diminishing adaption by Lemma 2.
(Con.3)
Condition A5 in Fort et al. Fort et al. (2011) is trivially satisfied with the parameters chosen in (Con.1).

We have established (Con.1), (Con.2), and (Con.3), therefore by Corollary 2.8 in Fort et al. (2011), (A.44) holds, and so does (50). $\square $

C Additional numerical results

This section will provide more numerical results in addition to Sect. 5.

1.1 C.1 Sensitivity analysis on thinning parameter $\omega $ for simulated data-sets

We repeat the experiment which studies the optimal value of $\omega $ and optimal average acceptance rate in Sect. 5.2 on simulated data-sets used in Sect. 5.1. Figures 7 and 8 show the effect of manipulating the target average acceptance rate on average squared jumping distance and average mean squared errors for the PARNIT and PARNIB proposals respectively. We split the plots into 4 sets where each set of graphs corresponds to a level of signal-to-noise ratio. In both figures, panels (a)–(d) show the negative relationship of $\omega $ against average acceptance rate. Panels (e)–(h) plot the average squared jumping distance against the average acceptance rate. Finally, panels (i)–(l) show the average acceptance rate and the average mean squared errors. These plots suggest the similar conclusion in Sect. 5.2 for which the average acceptance rate of 0.65 yields the largest average squared jumping distance. The smallest average mean squared errors are also located around the region of 0.65 average acceptance rate.

1.2 C.2 Additional results from Kiefer–Wolfowitz adaptation scheme

This section is to examine whether applying the Kiefer–Wolfowitz adaptation scheme to the PARNI sampler would lead to the optimal scaling property of the chains. We ran the PARNIT-KW and PARNIB-KW samplers for 1500 iterations with 3 different initial values on the 24 simulated data-sets of Sect. 5.1 and the 8 real data-sets of Sect. 5.2 and recorded the values of $\omega $. The trace plots of these $\omega $ values are given in Figs. 9, 10, 11, 12, 13, 14, 15 , and 16, for the simulated data-sets and Figs. 17 and 18 for the real data-sets. The black horizontal lines in these plots indicate the empirical optimal values of $\omega $ gathered for each data-set from Figs. 3, 4, 7 and 8. The optimal values decrease along with the dimensionality of p and they are also influenced by the correlation structure for which more complicated correlation structures imply smaller values of $\omega $. It appears that the values of $\omega $ are moving towards the black lines and converging to them regardless of initial values.

There is a significant trend that the $\omega $ values are approaching the region around the optimal values fairly quickly. Surprisingly, the parameter $\omega $ converges even faster in high-dimensional problems, for example, when $p=50,000$ in simulated data-sets and the SNP data-set. But there is still a rare chance that the Kiefer–Wolfowitz scheme does not lead to the optimal choice. Some of $\omega $ values become trapped on the value of 1. This issue is mainly caused by two reasons. Firstly, the ASJD estimators are often too noisy to capture the true signal in the expected squared jumping distances. These estimators can be viewed as simple Monte Carlo with only a few samples and therefore we may obtain estimates with extremely large estimation errors. Large errors introduce uncertainty into the true direction that should be updated and make $\omega $ take longer to converge or converge to a suboptimal value. Secondly, the use of the logistic transformation makes $\omega $ difficult to be updated in the boundary areas and therefore $\omega $ easily get trapped in values of 0 or 1.

Overall, the Kiefer–Wolfowitz adaptation scheme is relatively robust in tuning $\omega $ for the PARNI sampler, and we believe it can also be applied to other adaptive MCMC schemes in tuning the scaling parameters.

1.3 C.3 More results from simulated data-sets

In addition to Figs. 2, 19, 20 and 21 are trace plots of log posterior model probabilities from the ADS, ASI, PARNIT-RM, PARNIT-KW, PARNIB-RM and PARNIB-KW schemes on the simulated data-sets of Sect. 5.1 when $\text {SNR} = 0.5, 1$ and 3. Generally speaking, all the PARNI algorithms mix better than the ADS and ASI schemes on all data-sets. Except for the data-sets for which the posterior distributions do not concentrate in a few models (when $\text {SNR} = 0.5$), the ADS scheme always get struck on the empty model and struggles to include important variables and reach the high probability region within the first 1500 iterations. The ASI algorithm mixes quite well when p is relative small, but this algorithm is taking longer to converge and it is inefficient to jump between different models when p reaches 50, 000. On the other hand, the PARNIT-RM, PARNIT-KW, PARNIB-RM and PARNIB-KW samplers only take dozens of iterations to converge properly in all settings. In conclusion, the plots suggest that all the PARNI schemes outperform ADS and ASI in terms of the mixing time and convergence rate on the simulated data-sets. They always propose models with high probability of being accepted and therefore sufficiently explore the sample space.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liang, X., Livingstone, S. & Griffin, J. Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable selection. Stat Comput 32, 84 (2022). https://doi.org/10.1007/s11222-022-10137-8

Download citation

Received: 26 October 2021
Accepted: 16 August 2022
Published: 30 September 2022
DOI: https://doi.org/10.1007/s11222-022-10137-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive random neighbourhood informed Markov chain Monte Carlo for high-dimensional Bayesian variable selection

Abstract

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

A Systematic Review of Hidden Markov Models and Their Applications

1 Introduction

2 Background

2.1 Bayesian variable selection for the linear regression model

Remark 1

2.2 Adaptively scaled individual adaptation algorithm

Remark 2

2.3 Locally informed proposals for discrete-valued variables

3 Random neighbourhood samplers and the ASI algorithm

3.1 Random neighbourhood samplers

Example 1

Example 2

Example 3

Proposition 1

Remark 3

3.2 Another take on the ASI scheme

Remark 4

Remark 5

Proposition 2

Theorem 1

Corollary 1

4 Adaptive random neighbourhood and informed samplers

4.1 Adaptive random neighbourhood informed algorithm

Remark 6

4.2 The PARNI sampler

4.2.1 Main algorithm

Proposition 3

4.2.2 Adaptation schemes for algorithmic parameters

Remark 7

Remark 8

4.2.3 Ergodicity and strong law of large numbers

Lemma 1

Lemma 2

Theorem 2

5 Numerical studies

5.1 Simulated data

5.2 Real data

6 Discussion and future work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Additional materials

1.1 A.1 The Kiefer–Wolfowitz adaption scheme

B Proofs

1.1 B.1 Proof of Proposition 1

Proposition 4

Proof of Proposition 1

1.2 B.2 Proof of Proposition 2

Proof of Proposition 2

1.3 B.3 Proof of Theorem 1

Proposition 5

Proof of Proposition 5

Proof of Theorem 1

1.4 B.4 Proof of Corollary 1

Proof of Corollary 1

1.5 B.5 Proof of Proposition 3

Proof

1.6 B.6 Proof of Lemma 1

Proposition 6

Proof of Lemma 1

1.7 B.7 Proof of Lemma 2

Lemma 3

Proof of Lemma 3

Lemma 4

Proof

Lemma 5

Proof of Lemma 5

Lemma 6