## 1 Introduction

Sampling a random variable from a given target distribution is a key problem in Bayesian inference. The Langevin Monte Carlo (LMC) algorithm has attracted attention for its high efficiency and scalability in large dataset. Whereas sampling methods in this category are usually designed to handle unbounded random variables, a target variable is often limited to some bounded space in practical problems. In such cases, the common practice is to match the domain of the variable with a transformation. For example, when a target variable must be non-negative, the exponential function is frequently adopted as transformation.

This paper discusses the problem of drawing samples from a constrained target distribution via a transform of unconstrained samples generated by the LMC algorithm. More precisely, let $$\theta \sim \pi _\theta (\theta )$$ be the target random variable in constrained state space $${\mathbb{R}}_c$$ e.g. (semi-)finite interval in $${\mathbb{R}}$$, and $$\varphi \sim \pi (\varphi )$$ be a proxy random variable defined on the whole real line $${\mathbb{R}}$$. Although we are interested in sampling from $$\pi _\theta (\theta )$$, LMC is unsuitable for directly handling such variables because its diffusion is prone to overstep the boundary. Thus, we consider the following two-step LMC algorithm:

\begin{aligned} \varphi _{t+1}= & {} \varphi _t + \epsilon \widehat{\nabla }_\varphi \log \pi (\varphi _t) + \sqrt{2\epsilon } \eta _t, \end{aligned}
(1)
\begin{aligned} \theta _{t+1}= & {} f(\varphi _{t+1}), \end{aligned}
(2)

where f is a transform function that maps the proxy to the target domain and $$\widehat{\nabla }$$ denotes an unbiased stochastic gradient operator. The stochastic gradient corresponds to approximating the likelihood with subsamples in Bayesian sampling and improves computational efficiency in a large dataset.

The following three possible approaches conforming to Eqs. (1) and (2) are discussed in this paper.

• Mirroring trick (Sect. 3): the heuristics employed in Patterson and Teh (2013), simply matching the domain e.g. $$\theta =|\varphi |$$ for non-negative $$\theta$$ assuming the gradients are unchanged $$\widehat{\nabla }_\varphi \log \pi (\varphi ) = \widehat{\nabla }_\theta \log \pi _\theta (\theta )$$.

• Itô formula (Sect. 4): application of transformation f directly to the stochastic differential equation (SDE) that underlies Eqs. (1) and (2). This obtains $$\widehat{\nabla }_\varphi \log \pi (\varphi )$$ by Itô formula (Itô 1944), the chain rule in stochastic calculus.

• Change of random variable (CoRV) (Sect. 5): transformation f in random variable, obtaining $$\widehat{\nabla }_\varphi \log \pi (\varphi )$$ using Jacobian $$\pi (\varphi ) = \pi _\theta (\theta )|f'(\varphi )|$$.

Table 1 compares these approaches. Although the mirroring trick has previous studies and Itô formula is one of the most fundamental tools in stochastic analysis, many implementations and standard sampling software employ a method based on the CoRV approach without a theoretical investigation. In this paper, we analyze these methods and clarifies their properties. The theoretical results show that the CoRV approach has a pitfall in that some transform function amplifies the gradient noise and breaks the weak (distribution law) convergence of the Euler-Maruyama discretization scheme. We show that the Lipschitz continuity of the transform function suppresses the noise and recovers the weak convergence. We also confirm that such transformation makes the CoRV algorithm stable near a domain boundary after transformation. The numerical experiments of Bayesian non-negative matrix factorization and binary neural networks support the theories.

Technically, we also confirm the following propositions.

• The Itô formula approach almost surely diverges around a domain boundary regardless of target distributions and transform functions (Theorem 2).

• The CoRV approach with Lipschitz transform functions has a stationary distribution (Theorem 4) and converges weakly (Theorem 5) in the same order as SGLD on $${\mathbb{R}}$$.

Note that our discussion follows the common setting of stochastic gradient MCMC as in previous studies (Welling and Teh 2011; Sato and Nakagawa 2014; Teh et al. 2016). That is, we adopt stochastic gradients with minibatch and omit a Metropolis-Hastings rejection step to avoid performance overhead on a massive dataset. Thus, we guarantee the sampling accuracy by discretization analysis of SDE instead of conforming to the detailed balance of a Markov chain.

## 2 Review: stochastic gradient MCMC

This section outlines the original stochastic gradient MCMC algorithm designed for unconstrained state space. Our notation uses a one-dimensional parameter for simplicity. An extension of multi-dimensional cases is straightforward.

### 2.1 Langevin Monte Carlo

The Langevin Monte Carlo (LMC) is an efficient sampler for unbounded variables. Let us consider a target distribution $$\pi _\theta (\theta )$$ and its potential $$U_\theta (\theta ) = - \log \pi _\theta (\theta )$$. We put subscript $$_\theta$$ to emphasize the distribution and potential represents the constrained random variable $$\theta$$. The subscript is omitted when we discuss the unconstrained random variables. The LMC algorithm is

\begin{aligned} \theta _{t+1} = \theta _t - \epsilon _t U'_\theta (\theta _t) + \sqrt{2 \epsilon _t} \eta _t, \ \ \ \eta _t \sim {\mathcal{N}}(0,1), \end{aligned}
(3)

where $$U'_\theta (\theta ) = \frac{d}{d\theta } U_\theta (\theta )$$, $${\mathcal{N}}(0,1)$$ is the standard Gaussian distribution, and $$\epsilon _t>0$$ is the step size.

As step size $$\epsilon _t \rightarrow 0$$, the variable $$\theta _t$$ becomes an Itô process described by an SDE

\begin{aligned} d\theta (t) = - U'_\theta (\theta )dt + \sqrt{2} dW(t), \end{aligned}
(4)

with the Wiener process W(t). In other words, the LMC algorithm (3) can be seen as a discretization of the SDE (4) called Langevin equation. From the Fokker–Planck equation, the Itô process $$\theta (t)$$ is known to have the stationary distribution (Gardiner 2009)

\begin{aligned} \pi _\theta (\theta ) \propto \exp (-U_\theta (\theta )). \end{aligned}
(5)

Note that the LMC algorithm is capable of generating samples from the unnormalized form of density function $$\exp (-U_\theta (\theta ))$$. This property is favorable for handling Bayesian posteriors $$p(\theta |x)=\frac{p(x|\theta )p(\theta )}{\int p(x|\theta )p(\theta ) d\theta }$$, where the normalizing evidence term is often intractable.

### 2.2 Stochastic gradient Langevin dynamics

In Bayesian learning, the LMC algorithm is not scalable with respect to the number of data points. Let us consider that the target distribution is a Bayesian posterior of $$\theta$$ given observation $$x = \{x_1, \dots , x_n \}$$ with a prior $$p(\theta )$$,

\begin{aligned} \pi _\theta (\theta ; x) \propto p(x|\theta ) p(\theta ). \end{aligned}
(6)

The potential $$U_\theta$$ consists of the logarithm of prior and likelihood over the full data points. While LMC needs to calculate the derivative of potential

\begin{aligned} U'_\theta (\theta ) = - \sum _{i=1}^n \frac{d}{d\theta } \log p(x_i|\theta ) - \frac{d}{d\theta } \log p(\theta ) \end{aligned}
(7)

at each iteration, it is computationally intensive as the number of data points n grows large. This is critical in recent machine learning that uses big data.

To keep the time complexity constant, one can apply a stochastic approximation. Instead of the exact gradient, the stochastic gradient is computed via mini-batch $$S_t$$, only a subset of the full dataset,

\begin{aligned} \widehat{U}'_\theta (\theta ) = - \frac{n}{|S_t|}\sum _{x_i\in S_t} \frac{d}{d\theta } \log p(x_i|\theta ) - \frac{d}{d\theta } \log p(\theta ). \end{aligned}
(8)

It is a noisy approximation of $$U'_\theta (\theta )$$. Replacing the exact gradient of LMC with stochastic gradient, the stochastic gradient Langevin dynamics (SGLD) algorithm generates the samples as follows:

\begin{aligned} \theta _{t+1} = \theta _t - \epsilon _t \widehat{U}'_\theta (\theta _t) +\sqrt{2 \epsilon _t} \eta _t, \ \ \ \eta _t \sim {\mathcal{N}}(0,1). \end{aligned}
(9)

SGLD also enjoys computational gain by omitting a Metropolis-Hastings rejection step, which ordinary MCMC methods usually run to ensure a detailed balance.

Due to the approximation of gradient and exclusion of the rejection step, SGLD may not necessarily satisfy the detailed balance of the Markov chain. Instead, the weak convergence with regard to SDE (4) has been discussed in the literature (Sato and Nakagawa 2014; Teh et al. 2016). Let stochastic gradient satisfy the following assumtion.

### Assumption 1

(Gradient error) The stochastic gradient $$\widehat{U}'_\theta (\theta )$$ is written by using the accurate gradient $$U'_\theta (\theta )$$ and the error $$\delta$$ as

\begin{aligned} \widehat{U}'_\theta (\theta ) = U'_\theta (\theta ) + \delta , \end{aligned}
(10)

where $$\delta$$ is white noise or the Wiener process of zero mean and finite variance satisfying

\begin{aligned} {\mathbb{E}}\left[ \delta \right] = 0, \ \ \ \ \ {\mathbb{E}}\left[ |\delta |^l\right] < \infty , \end{aligned}
(11)

for some integer $$l \ge 2$$.

Then the following theorem holds for the sample sequence $$\{\theta _t\}_{t=1}^T$$. In short, the weak convergence states that the discretization error of SGLD becomes zero in expectation for any fixed time where the time increment approaches zero.

### Definition 1

(Weak convergence (Iacus 2008)) Let $$Y_\epsilon$$ be a time-discretized approximation of a continuous-time process Y and $$\epsilon _0$$ be the maximum time increment of the discretization. $$Y_\epsilon$$ is said to converge weakly to Y if for any fixed time T and any continuous differentiable and polynomial growth function h and constant $$\epsilon _0>0$$, it holds that

\begin{aligned} \lim _{\epsilon \rightarrow 0} \left| {\mathbb{E}}\left[ h(Y_\epsilon (T))\right] -{\mathbb{E}}\left[ h(Y(T))\right] \right| = 0, \ \ \ \ \forall \epsilon < \epsilon _0. \end{aligned}
(12)

### 2.3 Other related work

Brosse et al. (2017) developed another line of research for an LMC algorithm for a random variable on a convex body. They employed proximal MCMC (Pereyra 2016; Durmus et al. 2018) with the Moreau-Yosida envelope, which finds a well-behaved regularization of the target density on a convex body so that it preserves convexity and Lipschitzness. The sampling distribution is, nevertheless, an unbounded approximation of the target distribution, and it still draws samples from outside the domain. The limitation of log-concavity and the computing cost of the proximal operator at each sample prevent its application to large datasets as well as complex models such as neural networks.

## 3 Mirroring trick

Although many studies have been carried out for LMC and SGLD defined on real space $${\mathbb{R}}$$, theoretical analysis in the constrained space $${\mathbb{R}}_c$$ remains unsolved. The difficulty comes from that the LMC algorithm is a discretization of an Itô process whose equilibrium is a target distribution on $${\mathbb{R}}$$. This is problematic in multiple applications where we handle latent random variables in a bounded domain, such as latent Dirichlet allocation (Blei et al. 2003) where $$\theta$$ lies in a probability simplex, non-negative matrix factorization (Cemgil 2009) with all elements of $$\theta$$ being non-negative, and binary neural networks (Courbariaux et al. 2015; Hubara et al. 2016) with $$\theta \in (-1, 1)$$.

The mirroring trick is one of the straightforward heuristics to cope with this problem. This trick sends back outgoing samples at the domain boundaries so as not to overstep the constraint. Patterson and Teh (2013) employed it to sample from a Gamma distribution defined on $${\mathbb{R}}_+$$, merely taking the absolute value of the generated sample. There is no convergence guarantee for this trick, because it assumes that $$\widehat{\nabla } \log \pi (\varphi ) = \widehat{\nabla } \log \pi _\theta (\theta )$$ and transformation f does not change the equilibrium. The heuristics is partially justified by Bubeck et al. (2015) and Bubeck et al. (2018). They assumed accurate gradients and extended the LMC algorithm to an SDE with a reflecting boundary condition. Their stochastic process defined on a convex body, called reflected Brownian motion, is discretized into an LMC algorithm accompanied by the mirroring trick. This interpretation helps its theoretical investigation. However, Bubeck et al. (2018) also stated that the extension of their result to SGLD with stochastic gradients is an open problem for future work.

Our preliminary experiments show that the mirroring trick empirically suffers from inaccurate sampling near the boundaries. Figure 1 (see mirror) indicates that the mirroring trick fails to capture the distribution especially when the density is sparse, or concentrated on boundaries. The sampling can be inaccurate when the model uses a sparse prior that is often employed to avoid overfitting. This disadvantage forces us to set a tiny step size for accurate sampling near the boundary, which results in substantial performance degradation in experiments in Sect. 7.

## 4 Itô formula

Here we consider the following two-step modification: first, we use the Itô formula to construct the SDE in the unconstrained domain with the corresponding transform function. Then, we discretize the SDE to obtain the desired algorithm. While this derivation is straightforward and theoretically appreciated, we later show that this transformation inherits an instability near the boundary.

In the stochastic differential equation, we have the following formula.

### Theorem 1

(Itô Formula (Itô 1944)) Assume X (t) is an Itô process satisfying the stochastic differential equation

\begin{aligned} dX(t)=a(t,X(t))dt+b(t,X(t))dW(t). \end{aligned}
(13)

Let h (t, X(t)) be a given bounded function in $$C^2((0,\infty )\times {\mathbb{R}})$$. Then, h (t, X(t)) satisfies the stochastic differential equation

\begin{aligned} d h(t,X(t))={\mathcal{L}}_1 h(t,X(t)) dt+ {\mathcal{L}}_2 h(t,X(t))dW(t), \end{aligned}
(14)

where $${\mathcal{L}}_1$$and $${\mathcal{L}}_2$$are linear operators defined by

\begin{aligned} {\mathcal{L}}_1&=\frac{\partial }{\partial t}+a\frac{\partial }{\partial X} +\frac{1}{2}b^2\frac{\partial ^2}{\partial X^2},\quad {\mathcal{L}}_2 =b \frac{\partial }{\partial X}. \end{aligned}
(15)

We begin by transforming the following Itô process of $$\theta (t)$$,

\begin{aligned} d\theta (t) = a(t,\theta )dt + b(t,\theta ) dW(t). \end{aligned}
(16)

Let $$g:{\mathbb{R}}_c \rightarrow {\mathbb{R}}$$ be a smooth invertible function from a bounded target variable $$\theta \in {\mathbb{R}}_c$$ to an unbounded proxy variable $$\varphi \in {\mathbb{R}}$$. The constrained state space $${\mathbb{R}}_c$$ can be a finite or semi-infinite interval of $${\mathbb{R}}$$ (see Definition 2 and Table 2 for details). We consider a new stochastic process $$\varphi (t)$$ defined by

\begin{aligned} \varphi (t) = g(\theta (t)). \end{aligned}
(17)

From the Itô formula (Theorem 1), $$\varphi (t)$$ is also an Itô process of

\begin{aligned} d\varphi (t) = \left\{ a(t,\theta ) g'(\theta (t)) + \frac{b^2}{2} g''(\theta (t)) \right\} dt + b(t,\theta ) g'(\theta (t)) dW(t). \end{aligned}
(18)

Letting $$a(\theta ) = - U'_\theta (\theta )$$ and $$b = \sqrt{2}$$, discretizing the process results in the following LMC

\begin{aligned} \varphi _{t+1} = \varphi _t + \epsilon \left( - g'(\theta _t) U'_\theta (\theta _t) + g''(\theta _t)\right) + \sqrt{2\epsilon } g'(\theta _t) \eta _t. \end{aligned}
(19)

While a general connection between SDE and LMC is discussed by Ma et al. (2015), this algorithm is distinct in that the transform step $$\theta = g^{-1}(\varphi )$$ is employed to keep samples in the target domain.

Unfortunately, Eq. (19) is likely to draw inaccurate samples whether or not the gradient has noise. Figure 1 demonstrates that this method (labeled as Ito) fails to track the target density. We attribute this phenomenon to the intrinsic instability around the boundary regardless of the target potential and the transform function.

To theoretically discuss this instability, we first assume the following class of transform functions.

### Assumption 2

(Transform function) Let f be a Lipschitz and monotonically increasing function. Namely, there exists constant $$L>0$$ such that, for all $$\varphi \in {\mathbb{R}}$$,

\begin{aligned} 0 \le f'(\varphi ) \le L. \end{aligned}
(20)

The boundary value of target domain denoted by $$\partial S$$ corresponds to the infinity in the proxy space: $$\lim _{\varphi \rightarrow \infty } f(\varphi ) = \partial S$$, and $$\lim _{\varphi \rightarrow \infty } f'(\varphi )$$ exists.

All functions except the exponential in Table 3 satisfy this assumption.

Depending on the constraints in the target domain, f may be a decreasing or upper- and lower-bounded function. We continue with Assumption 2 for simplicity, though our discussion is also applicable to the more general d-dimensional case:

### Definition 2

(Constrained state space) The constrained state space $${\mathbb{R}}_c^d$$ is an d-dimensional interval defined as a subset of $${\mathbb{R}}^d$$ that is the Cartesian product of d (semi-)finite intervals, $$I=I_{1}\times I_{2}\times \cdots \times I_{d}$$, one on each coordinate axis.

The instability of the algorithm is shown in the following theorem.

### Theorem 2

(Instability of the Itô transformation) Let $$f=g^{-1}: {\mathbb{R}}\rightarrow {\mathbb{R}}_c$$ satisfy Assumption 2. Then for any $$\epsilon >0$$, and $$U'_\theta (\theta )$$, and any $$\theta \in S$$ approaching $$\partial S$$ from the inside, the single-step difference of the Itô transformation method diverges almost surely:

\begin{aligned} \lim _{\theta \rightarrow \partial S} |\varphi _{t+1} - \varphi _t| = \infty \end{aligned}
(21)

Please refer to Sect. 6 for all the proofs in this paper.

It suggests that the step size must be small enough to cope with this instability, but it would make the sampling substantially slow to mix.

## 5 Change of random variable

We consider another formulation to employ a transformation step in LMC. The derivation methodology is the opposite of the Itô transformation; we begin with a discretized algorithm and then consider the corresponding continuous-time SDE. We use this SDE representation to derive Theorem 5, which guarantees the sampling accuracy of the method without a rejection step. Also, this algorithm overcomes the instability issue by Theorem 3, unlike the former Itô method.

Let function $$f: {\mathbb{R}} \rightarrow {\mathbb{R}}_c$$ be a twice differentiable monotonic function from an unbounded proxy variable $$\varphi \in {\mathbb{R}}$$ to a bounded target variable $$\theta \in {\mathbb{R}}_c$$

\begin{aligned} \theta = f(\varphi ), \end{aligned}
(22)

then the target density $$\pi _\theta (\theta )$$ and the proxy density $$\pi (\varphi )$$ are known to have the following relation,

\begin{aligned} \pi (\varphi ) = \pi _\theta (\theta ) \left| f'(\varphi )\right| . \end{aligned}
(23)

For the proxy potential $$U(\varphi ) \propto - \log \pi (\varphi )$$, proxy $$U'(\varphi )$$ is represented by given target $$U'_\theta (\theta ):$$

$$U^{\prime } (\varphi ) = f^{\prime } (\varphi )U_{{\theta ^{\prime } }} (\theta ) - \frac{{f^{{\prime \prime }} (\varphi )}}{{f^{\prime } (\varphi )}}.$$
(24)

One can enjoy the computational gain using the stochastic gradient $$\widehat{U}'_\theta$$, and construct the SGLD algorithm for the proxy variable:

\begin{aligned} \varphi _{t+1} = \varphi _t - \epsilon \left( f'(\varphi _t) \widehat{U}'_\theta (\theta _t) - \frac{f''(\varphi _t)}{f'(\varphi _t)} \right) + \sqrt{2\epsilon } \eta _t. \end{aligned}
(25)

We call this algorithm change-of-random-variable (CoRV) SGLD. CoRV SGLD forms a generalized class of samplers that contains the ordinary SGLD. Indeed, we recover SGLD by using the identity function as the transform $$f(\varphi ) = \varphi$$. CoRV SGLD satisfies the following advantages.

• The algorithm maintains the efficient computational complexity of the stochastic gradient.

Equation (25) does not require iteration over the entire dataset.

• The samples are always in the target constrained space $${\mathbb{R}}_c$$. Equation (25) generates a proxy sample $$\varphi _t \in {\mathbb{R}}$$ and then Eq. (22) transforms it into a target sample $$\theta _t \in {\mathbb{R}}_c$$.

• Any transform functions f can be employed in Eq. (25) if it is twice differentiable monotonic and $$\frac{f''(\varphi )}{f'(\varphi )}$$ exists. Many common functions satisfy this condition, such as exponential, sigmoid, and softmax functions.

### 5.1 Stability

The following theorem explains the stability of CoRV SGLD by showing that the transformation does not cause an abrupt movement in the dynamics.

### Theorem 3

(Stability of CoRV) Let transform function f satisfy Assumption 2. Then for a gradient error $$\delta _\varphi$$ and for any $$\theta \in S$$ approaching $$\partial S$$ from the inside, we have:

\begin{aligned} \lim _{\theta \rightarrow \partial S} \delta _\varphi = 0. \end{aligned}
(26)

### 5.2 Stationary distribution

We consider the following SDE of proxy variable $$\varphi$$ as the continuous counterpart of Eq. (25)

\begin{aligned} d\varphi (t) = - U'(\varphi (t)) dt + \sqrt{2}dW(t), \end{aligned}
(27)

so as to apply the tools of stochastic analysis. We confirm the existence and uniqueness of the weak solution and obtain its equilibrium.

Unlike the unconstrained case, a constrained target distribution $$\pi _\theta (\theta )$$ often has a nonzero density at a domain boundary. The following lemma is required so that the unnormalized proxy distribution $$\int _{\varphi } \exp (-U(\varphi ))d\varphi$$ does not diverge.

### Lemma 1

(Proxy potential) Let f satisfy Assumption 2, and a target pdf $$\pi _\theta (\theta )$$ have a finite limit as $$\theta$$ goes to the boundary. Then for any $$U(\varphi )$$, we have:

\begin{aligned} \lim _{\varphi \rightarrow \infty } U(\varphi ) = \infty . \end{aligned}
(28)

Lemma 1 is enough for some cases (e.g. truncated normal). However, in order to show the same proposition for distributions that have an infinite density at a boundary (e.g., the beta and gamma), we need the following additional assumption.

### Assumption 3

Let $$\pi _\theta$$ be the target probability distribution. Transform function f satisfies

\begin{aligned} \lim _{\varphi \rightarrow \infty } \pi _\theta (f(\varphi )) |f'(\varphi )| = 0. \end{aligned}
(29)

Under the existence and uniqueness of solution (see Sect. 6.2), we derive the stationary distribution of Eq. (27) as follows.

### Theorem 4

(Stationary distribution) Let transform function f satisfy Assumptions 2 and 3. For transition probability density functions $$p(\varphi ,t)$$ and $$p(\theta ,t)$$ of the variables at time t, we have:

\begin{aligned} \lim _{t\rightarrow \infty } p(\varphi ,t) = \pi (f(\varphi )) \left| f'(\varphi )\right| \end{aligned}
(30)

and

\begin{aligned} \lim _{t\rightarrow \infty } p(\theta ,t) = \pi _\theta (\theta ). \end{aligned}
(31)

### 5.3 Weak convergence

We also check Eq. (25) does not break the unique weak solution of Eq. (27) by confirming that the discretization error is bounded. From Lemmas 1 and 5, the weak convergence is derived.

### Theorem 5

(Weak convergence) Let transform function f satisfy Assumption  2.

For any continuously differentiable and polynomial growth functions h and $$h_\theta$$, we have:

\begin{aligned} \left| {\mathbb{E}}\left[ h(\widetilde{\varphi }(T))\right] -{\mathbb{E}}\left[ h(\varphi (T))\right] \right| = {\mathcal{O}}(\epsilon _0) \end{aligned}
(32)

and

\begin{aligned} \left| {\mathbb{E}}\left[ h_\theta (\widetilde{\theta }(T))\right] -{\mathbb{E}}\left[ h_\theta (\theta (T))\right] \right| = {\mathcal{O}}(\epsilon _0), \end{aligned}
(33)

where $$\varphi (T)$$ and $$\theta (T)$$denote the random variables at fixed time T, $$\widetilde{\varphi }(T)$$ and $$\widetilde{\theta }(T)$$denote discretized samples at fixed time T by CoRV SGLD, and $$\epsilon _0>0$$ is the initial step size.

Empirical result We empirically confirm Theorem 5 using basic distributions. The expectation of continuous process $$h_\theta (\theta (T))$$ is substituted with its true expectation and the identity function $$h_\theta (\theta ) = \theta$$ was selected. Specifically, we set $${\mathbb{E}}\left[ h_\theta (\theta (T))\right] = 0.25$$ for the gamma distribution with its shape and scale being 0.5. Figure 2 shows the numerical errors corresponding to Eq. (33) for three sampling methods. We can see that the error of CoRV almost linearly scales with step size $$\epsilon _0$$, as suggested by Theorem 5. The errors of mirror and Ito are significantly greater than CoRV. The smaller step sizes do not improve Ito, implying the difficulty for practical application.

## 6 Proofs

### Proof

From the assumption of the target pdf, and $$\varphi \rightarrow \infty$$ as $$\theta \rightarrow \partial S$$,

\begin{aligned} \lim _{\varphi \rightarrow \infty } \pi (f(\varphi )) < C, \end{aligned}
(34)

for constant $$C>0$$. From Lemma 4,

\begin{aligned} \lim _{\varphi \rightarrow \infty } \pi (f(\varphi )) |f'(\varphi )| = 0. \end{aligned}
(35)

Using

\begin{aligned} U(\varphi ) = - \log \left( \pi (\theta ) |f'(\varphi )| \right) , \end{aligned}
(36)

we have

\begin{aligned} \lim _{\varphi \rightarrow \infty } U(\varphi ) = \infty . \end{aligned}
(37)

$$\square$$

### 6.2 Solution existence and uniqueness of SDE (27)

We check the existence of the solution of the SDE (27). The following result is well-known.

### Lemma 2

(Solution existence) Let $$U'(\varphi )$$ be a continuous function of $$\varphi$$. Then the solution of the SDE (27) exists.

We also confirm the uniqueness of the solution. We employ the weak uniqueness for the uniqueness in the sense of a distribution law.

### Theorem 6

(Weak uniqueness (Stroock and Varadhan 1979)) Consider a d-dimensional SDE of $$X\in {\mathbb{R}}^d$$,

\begin{aligned} dX(t) = a(X(t)) dt + b(X(t)) dW(t). \end{aligned}
(38)

Let a(x) be a bounded measurable function for $$x\in {\mathbb{R}}^d$$. Let $$B(x) = b(x)^\intercal b(x)$$ be a bounded, continuous function where constant $$K>0$$ exists such that

\begin{aligned} \sum _{i,j=1}^{d} B_{ij}(x) \zeta _i \zeta _j \ge K |\zeta |^2, \end{aligned}
(39)

for $$\zeta = (\zeta _1,\cdots ,\zeta _d) \in {\mathbb{R}}^d$$. Then the uniqueness in the sense of a distribution law holds for the solution of the SDE (38).

The solution is unique under the following condition of $$U'(\varphi )$$.

### Lemma 3

(Solution uniqueness) Let the proxy potential gradient $$U'(\varphi )$$ be a bounded function. Then the solution of the SDE (27) is unique in the sense of a distribution law.

### Proof

From Theorem 6, the condition of the diffusion coefficient is straightforwardly confirmed by letting $$b = \sqrt{2}$$ and $$\zeta \in {\mathbb{R}}$$, there exists constant $$K>0$$ such that

\begin{aligned} b^2 \zeta ^2 \ge K \zeta ^2. \end{aligned}
(40)

$$\square$$

### 6.3 Lemma 4

The following lemma is essential for showing the proxy potential and the solution of proxy SDE (27).

### Lemma 4

(Limit of transform derivative) Under Assumption 2, we have

\begin{aligned} \lim _{\varphi \rightarrow \infty } f'(\varphi ) = 0. \end{aligned}
(41)

### Proof

Using the L’Hôpital’s rule,

\begin{aligned} \lim _{\varphi \rightarrow \infty } f(\varphi )&= \lim _{\varphi \rightarrow \infty } \frac{\exp (\varphi ) f(\varphi )}{\exp (\varphi )}\nonumber \\&= \lim _{\varphi \rightarrow \infty } \frac{\exp (\varphi ) (f(\varphi ) +f'(\varphi ))}{\exp (\varphi )}\nonumber \\&= \lim _{\varphi \rightarrow \infty } (f(\varphi )+f'(\varphi )). \end{aligned}
(42)

Thus we have

\begin{aligned} \lim _{\varphi \rightarrow \infty } f'(\varphi ) = 0. \end{aligned}
(43)

$$\square$$

### Lemma 5

(Proxy gradient error) Let $$\delta$$ be a noise of the stochastic gradient of the target potential that satisfies Assumption 1, and let f satisfy Assumption  2. Then for any noise $$\delta _\varphi$$ of the stochastic gradient of the proxy potential

\begin{aligned} \widehat{U}'(\varphi ) = U'(\varphi ) + \delta _\varphi , \end{aligned}
(44)

we have:

\begin{aligned} {\mathbb{E}}\left[ \delta _\varphi \right] = 0, \ \ \ \ \ {\mathbb{E}}\left[ |\delta _\varphi |^l\right] < \infty , \end{aligned}
(45)

for some integer $$l \ge 2$$.

### Proof

Since $$\widehat{U}'_\theta (\theta )$$ satisfies Assumption 1

\begin{aligned} \widehat{U}'_\theta (\theta ) = U'_\theta (\theta ) + \delta , \end{aligned}
(46)

as in Eq. (24), the stochastic gradient of the proxy potential is

\begin{aligned} \widehat{U}'(\varphi )&= f'(\varphi ) \left( U'_\theta (\theta ) +\delta \right) - \frac{f''(\varphi )}{f'(\varphi )}\nonumber \\&= U'(\varphi ) + \delta _\varphi , \end{aligned}
(47)

by letting $$\delta _\varphi = f'(\varphi ) \delta$$. Since Assumption 2 suggests that the derivative of transform is always finite, $$\delta _\varphi$$ also satisfies zero mean and finite variance

\begin{aligned} {\mathbb{E}}\left[ \delta _\varphi \right] = 0, \ \ \ \ \ {\mathbb{E}} \left[ |\delta _\varphi |^l\right] < \infty . \end{aligned}
(48)

$$\square$$

### Proof

From Lemma 4 in Sect. 6.3,

\begin{aligned} \lim _{\varphi \rightarrow \infty } \frac{d}{d\varphi } g^{-1}(\varphi ) = 0. \end{aligned}
(49)

This implies that

\begin{aligned} \lim _{\theta \rightarrow \partial S} |g'(\theta )| = \lim _{\varphi \rightarrow \infty } \frac{1}{\left| \frac{d}{d\varphi } g^{-1}(\varphi )\right| } = \infty . \end{aligned}
(50)

From Eq. (19), the single-step difference is given by

\begin{aligned} |\varphi _{t+1} - \varphi _t| = \left| \epsilon _t \left( - g'(\theta _t)U'_\theta (\theta _t) + g''(\theta _t)\right) +\sqrt{2\epsilon _t} g'(\theta _t) \eta _t \right| . \end{aligned}
(51)

Considering $$\eta _t\sim {\mathcal{N}}(0,1)$$, the factor $$g'(\theta )$$ almost surely dominates this quantity. Therefore,

\begin{aligned} \lim _{\theta \rightarrow \partial S} |\varphi _{t+1} - \varphi _t| = \infty . \end{aligned}
(52)

$$\square$$

### Proof

From Eq. (47) of Lemma 5 in Sect. 6.4, we have

\begin{aligned} \widehat{U}'(\varphi ) = U'(\varphi ) + \delta _\varphi , \end{aligned}
(53)

where $$\delta _\varphi = f'(\varphi ) \delta$$ and $$\delta$$ satisfies Assumption 1. From Lemma 4,

\begin{aligned} \lim _{\theta \rightarrow \partial S} \delta _\varphi = \lim _{\varphi \rightarrow \infty } f'(\varphi ) \delta = 0. \end{aligned}
(54)

$$\square$$

### Proof

From Lemmas 2 and 3, there exists a unique solution in the sense of a distribution law.

From Lemmas 1, 5 and Assumption 3, the SDE (27) satisfies the same assumption that Sato and Nakagawa (2014) used for SGLD in unconstrained state space. The transition probability density function $$p(\varphi ,t)$$ follows the Fokker-Planck equation

\begin{aligned} \frac{\partial }{\partial t} p(\varphi ,t) = - \frac{\partial }{\partial \varphi } \left( U'_\varphi (\varphi ) p(\varphi ,t) \right) +\frac{\partial ^2}{\partial \varphi ^2} p(\varphi ,t), \end{aligned}
(55)

and its stationary distribution is

\begin{aligned} \lim _{t\rightarrow \infty } p(\varphi ,t) = \exp (-U(\varphi )) = \pi (\varphi ). \end{aligned}
(56)

Note that $$f'(\varphi (t))$$ is always finite from Assumption 2. Applying Eq. (23), we obtain the stationary distribution as

\begin{aligned} \lim _{t\rightarrow \infty } p(\theta ,t)|f'(\varphi )|&= \pi _\theta (\theta )|f'(\varphi )|\nonumber \\ \therefore \lim _{t\rightarrow \infty } p(\theta ,t)&= \pi _\theta (\theta ). \end{aligned}
(57)

$$\square$$

### Proof

Let us consider stochastic differential equation

\begin{aligned} d\varphi (t) = a(\varphi (t))dt + b(\varphi (t))dW(t), \ \ \ 0 \le t \le T \end{aligned}
(58)

and its approximation in time $$t_{k-1} \le t \le t_k$$

\begin{aligned} d\widetilde{\varphi }(t) = \widetilde{a}(\varphi (t))dt +\widetilde{b}(\varphi (t))dW(t), \end{aligned}
(59)

where $$\widetilde{a}(\varphi (t)) = a(\varphi (t)) +\delta _{\varphi ,t}$$.

Using Lemma 5 and Theorem 6 of Sato and Nakagawa (2014), for the test function h, we have

\begin{aligned} \left| {\mathbb{E}}\left[ h(\widetilde{\varphi }(T))\right] -{\mathbb{E}}\left[ h(\varphi (T))\right] \right|&= \left| \int _0^T {\mathbb{E}}\left[ \left( \widetilde{a}(\varphi (t)) -a(\varphi (t))\right) \frac{\partial }{\partial \varphi } {\mathbb{E}} \left[ h(\widetilde{\varphi }(t))\right] \right] dt\right. \nonumber \\&\quad \left. + \int _0^T \frac{1}{2} {\mathbb{E}} \left[ \left( \widetilde{b}(\varphi (t))^2-b(\varphi (t))^2\right) \frac{\partial ^2}{\partial \varphi ^2} {\mathbb{E}} \left[ h(\widetilde{\varphi }(t))\right] \right] dt \right| \end{aligned}
(60)

From the Weierstrass theorem, there exists constant $$C_k>0$$ such that

\begin{aligned}&{\mathbb{E}}\left[ \left( \widetilde{a}(\varphi (t))-a(\varphi (t))\right) \frac{\partial }{\partial \varphi } {\mathbb{E}} \left[ h(\widetilde{\varphi }(t))\right] \right] \le C_k \epsilon _{t_{k-1}} \end{aligned}
(61)
\begin{aligned}&{\mathbb{E}}\left[ \left( \widetilde{b}(\varphi (t))^2 -b(\varphi (t))^2\right) \frac{\partial ^2}{\partial \varphi ^2} {\mathbb{E}} \left[ h(\widetilde{\varphi }(t))\right] \right] \le C_k \epsilon _{t_{k-1}} \end{aligned}
(62)

for time $$t_{k-1} \le t \le t_k$$. Letting the maximum value of $$C_k$$ be $$C_{\rm{max}}$$ and $$\epsilon _{t_{k-1}}$$ be $$\epsilon _0$$,

\begin{aligned} \left| {\mathbb{E}}\left[ h(\widetilde{\varphi }(T))\right] -{\mathbb{E}}\left[ h(\varphi (T))\right] \right| < T C_{\rm{max}} \epsilon _0. \end{aligned}
(63)

That is, the sample of proxy variable $$\varphi$$ generated by Eq. (25) weakly converges

\begin{aligned} \left| {\mathbb{E}}\left[ h(\widetilde{\varphi }(T))\right] -{\mathbb{E}}\left[ h(\varphi (T))\right] \right| = {\mathcal{O}}(\epsilon _0). \end{aligned}
(64)

Let test function h be a composition of transform function f and test function $$h_\theta$$ in the target domain: $$h(\cdot ) =h_\theta (f(\cdot )))$$. Thus, $$h(\varphi (T)) = h_\theta (\theta (T))$$ and $$h(\widetilde{\varphi }(T)) = h_\theta (\widetilde{\theta }(T))$$. The sample of target variable $$\theta$$ satisfies

\begin{aligned} |{\mathbb{E}}\left[ h(\widetilde{\varphi }(T))\right] - {\mathbb{E}} \left[ h(\varphi (T))\right] |&= {\mathcal{O}}(\epsilon _0)\nonumber \\ \therefore |{\mathbb{E}}\left[ h_\theta (\widetilde{\theta }(T))\right] -{\mathbb{E}}\left[ h_\theta (\theta (T))\right] |&= {\mathcal{O}}(\epsilon _0). \end{aligned}
(65)

$$\square$$

## 7 Experiments

In this section, we show the usefulness of our method using a range of models for many application scenarios. Results demonstrate a practical efficacy of the CoRV approach on top of the theoretical justifications that have been discussed. We used the P100 GPU accelerator for all experiments.

### 7.1 Bayesian NMF

For a typical application that uses a probability distribution supported on a finite or semi-infinite interval, we considered Bayesian non-negative matrix factorization (Cemgil 2009). Given an observed $$I\times J$$ matrix X, whose components take non-negative values, we approximated it with a low-rank matrix product WH, where W is an $$I\times R$$, and H is an $$R\times J$$ non-negative matrix. The prior distribution and likelihood are

\begin{aligned}&W_{ir} \sim \text {Exponential}(\lambda _W), \ \ \ \ \ H_{rj} \sim \text {Exponential}(\lambda _H), \end{aligned}
(66)
\begin{aligned}&X_{ij} | W_{i:}, H_{:j} \sim \text {Poisson} \left( \sum _{r=1}^R W_{ir} H_{rj}\right) , \end{aligned}
(67)

where $$\lambda _W$$ and $$\lambda _H$$ are hyper-parameters.

SGLD use the mirroring trick to generate samples with stochastic gradient evaluated with a mini-batch as follows:

\begin{aligned} W_{i:}^* = \left| W_{i:} - \epsilon _t \widehat{U}'_{W_{i:}} + \sqrt{2} \eta \right| , \end{aligned}
(68)

where noise $$\eta$$ conforms to $${\mathcal{N}}(0,I)$$ with I being the $$R\times R$$ identity matrix. $$W_{i:}^*$$ denotes the sample at time $$t+1$$ given $$W_{i:}$$ is the sample at time t. The absolute value is taken in an element-wise manner, which corresponds to the mirroring trick. The stochastic gradient is

\begin{aligned} \widehat{U}'_{W_{i:}} = - \frac{N}{|S|}\sum _{X_k\in S} H_{: j_k} \left( \frac{X_{k}}{\widehat{X}_{k}} -1 \right) + \lambda _W, \end{aligned}
(69)

where $$j_k \in \{1,\cdots ,J\}$$ is the index of the kth data point in mini-batch S, $$X_{k}$$ is a non-negative value of the kth data, and $$\widehat{X}_{k} = \sum _{r=1}^R W_{i_k r} H_{r j_k}$$ is its estimate.

CoRV SGLD updates proxy variables by

\begin{aligned} \varphi _{W_{i:}}^* = \varphi _{W_{i:}} - \epsilon _t \left( f'(\varphi _{W_{i:}}) \widehat{U}'_{W_{i:}} -\frac{f''(\varphi _{W_{i:}})}{f'(\varphi _{W_{i:}})} \right) + \sqrt{2} \eta . \end{aligned}
(70)

Here $$f'$$ and $$f''$$ are applied element-by-element. We draw the sample of $$\varphi _H$$ in the same manner. Note that Eq. (70) bypasses the mirroring trick because proxy variables $$\varphi _W$$ and $$\varphi _H$$ are in the entire domain $${\mathbb{R}}$$. Matrices WH are always non-negative via element-wise transform $$f: {\mathbb{R}} \rightarrow {\mathbb{R}}_+$$. The algorithms are shown below.

Configuration In the experiments, we employed the MovieLens dataset, a commonly used benchmark for matrix factorization tasks (Ahn et al. 2015). The data matrix X consists of $$I = 71,567$$ users and $$J = 10,681$$ items with in total 10, 000, 054 non-zero entries. It was split into 75 : 12.5 : 12.5 for training, validation, and testing. We compared (1) CoRV, (2) the state-of-the-art SGLD-based method (Ahn et al. 2015) with modification for non-negative values, and (3) SGRLD (Patterson and Teh 2013) using natural gradient with diagonal preconditioning. Methods (2) and (3) used the mirroring trick. We evaluated each of the sampling methods through the Bayesian prediction accuracy. We also compared three transform functions for constraining to non-negative variables: exp, softplus, and ICLL in Table 3. The Itô formulation was omitted due to significant numerical instability. The test root mean square error (RMSE) was used as a performance metric. The prediction was given by the Bayesian predictive mean computed by a moving average. We set the number of dimensions of latent variables R to 20 and 50. We trained for 10, 000 iterations with $$R=20$$ and for 20, 000 with $$R=50$$. The step size was chosen by TPE of 100 trials to minimize the validation loss. We set hyper-parameter $$\lambda _W, \lambda _H$$ to 1.0 and the size of the mini-batch |S| to 10, 000.

Result Figure 3 shows the curves of root mean square error (RMSE) values as a function of iterations. SGLD and SGRLD are existing methods with the mirroring trick, whereas exp, softplus, and ICLL indicate our method with the specified transform function. We observed that CoRV SGLD made better predictions with smaller iterations than the other two algorithms. When $$R=20$$ (Fig. 3 left), SGLD took 10, 000 iterations to reach an RMSE of 0.90, whereas CoRV SGLD (softplus and ICLL) achieves it with only 3, 000 iterations. While the choice of transform functions may influence the performance, CoRV outperformed the best performing baseline SGRLD.

Computational Complexity CoRV SGLD requires additional computation of the transformation step compared to the vanilla SGLD (see Algorithm 2). In most cases, gradient computation is dominant in the SGLD calculation, which is proportional to the number of data in each mini-batch. CoRV SGLD depends only on the number of parameters and does not change the complexity of gradient computation. The influence on the computation time was limited, as the measured execution time was up to $$+10\%$$ at the maximum.

### 7.2 Bayesian binary neural network

A binary neural network, whose parameters are restricted to binary, is expected to achieve high performance on small devices in terms of memory efficiency (Courbariaux et al. 2015; Hubara et al. 2016). We evaluated each of the sampling methods through the Bayesian prediction accuracy of a binary neural network model.

The weight parameter w was trained as a continuous variable and then binarized into $$\{-1, +1\}$$ by thresholding at 0 before the inference. A transformed beta prior was defined on a constrained state space of $$(-1,+1)$$,

\begin{aligned} p(w) = \frac{1}{2B(\alpha ,\beta )} \left( \frac{1}{2}w+\frac{1}{2} \right) ^{\alpha -1} \ \left( -\frac{1}{2}w+\frac{1}{2} \right) ^{\beta -1}, \end{aligned}
(71)

with hyper-parameters $$\alpha ,\beta <1$$ and the beta function B. It encourages the parameter to have high density around the boundary $$\{-1,+1\}$$, so that a method that can correctly sample the edges would be expected to keep high accuracy after binarization.

In the experiments, we considered a Bayesian binary three-layer feed-forward network containing 50 hidden units with the ReLU activation. We employed MNIST (Lecun et al. 1998) and Fashion-MNIST (Xiao et al. 2017) datasets. Both datasets were split into 8 : 1 : 1 for training, validation, and testing. We compared (1) CoRV using the sigmoid, arctangent, and softsign functions, and (2) standard SGLD with the mirroring trick. The Itô formulation was omitted due to significant numerical instability. We evaluated the cross-entropy loss of the softmax classifier and classification accuracy. The accuracy was given by the Bayesian predictive mean computed by a moving average of binarized weights at each epoch. We trained the networks for 100 epochs with MNIST and 300 epochs with Fashion-MNIST. The step size was chosen by TPE of 100 trials to minimize the validation loss.

Results Figure 4 presents the test loss and accuracy. Note that this experiment aims to compare sampling methods on the same model rather than to propose a state-of-the-art network. The learning curves show that CoRV achieves better prediction than mirroring heuristics. It is useful in practice that transformation enables stable computation with a large step size.

## 8 Conclusion

SGLD has resorted to some heuristics for sampling bounded random variables since SGLD is designed for unbounded ones. We demonstrated such heuristics may sacrifice the sampling accuracy both empirically and theoretically. To deal with such random variables, we generalized SGLD using the change-of-random-variable (CoRV) formulation and analyzed its weak convergence. Empirical evaluations showed that our CoRV SGLD outperformed existing heuristic alternatives on Bayesian non-negative matrix factorization and neural networks.