Abstract
Stochastic gradient Langevin dynamics (SGLD) is a computationally efficient sampler for Bayesian posterior inference given a large scale dataset and a complex model. Although SGLD is designed for unbounded random variables, practical models often incorporate variables within a bounded domain, such as nonnegative or a finite interval. The use of variable transformation is a typical way to handle such a bounded variable. This paper reveals that several mapping approaches commonly used in the literature produce erroneous samples from theoretical and empirical perspectives. We show that the change of random variable in discretization using an invertible Lipschitz mapping function overcomes the pitfall as well as attains the weak convergence, while the other methods are numerically unstable or cannot be justified theoretically. Experiments demonstrate its efficacy for widelyused models with bounded latent variables, including Bayesian nonnegative matrix factorization and binary neural networks.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Sampling a random variable from a given target distribution is a key problem in Bayesian inference. The Langevin Monte Carlo (LMC) algorithm has attracted attention for its high efficiency and scalability in large dataset. Whereas sampling methods in this category are usually designed to handle unbounded random variables, a target variable is often limited to some bounded space in practical problems. In such cases, the common practice is to match the domain of the variable with a transformation. For example, when a target variable must be nonnegative, the exponential function is frequently adopted as transformation.
This paper discusses the problem of drawing samples from a constrained target distribution via a transform of unconstrained samples generated by the LMC algorithm. More precisely, let \(\theta \sim \pi _\theta (\theta )\) be the target random variable in constrained state space \({\mathbb{R}}_c\) e.g. (semi)finite interval in \({\mathbb{R}}\), and \(\varphi \sim \pi (\varphi )\) be a proxy random variable defined on the whole real line \({\mathbb{R}}\). Although we are interested in sampling from \(\pi _\theta (\theta )\), LMC is unsuitable for directly handling such variables because its diffusion is prone to overstep the boundary. Thus, we consider the following twostep LMC algorithm:
where f is a transform function that maps the proxy to the target domain and \(\widehat{\nabla }\) denotes an unbiased stochastic gradient operator. The stochastic gradient corresponds to approximating the likelihood with subsamples in Bayesian sampling and improves computational efficiency in a large dataset.
The following three possible approaches conforming to Eqs. (1) and (2) are discussed in this paper.

Mirroring trick (Sect. 3): the heuristics employed in Patterson and Teh (2013), simply matching the domain e.g. \(\theta =\varphi \) for nonnegative \(\theta\) assuming the gradients are unchanged \(\widehat{\nabla }_\varphi \log \pi (\varphi ) = \widehat{\nabla }_\theta \log \pi _\theta (\theta )\).

Itô formula (Sect. 4): application of transformation f directly to the stochastic differential equation (SDE) that underlies Eqs. (1) and (2). This obtains \(\widehat{\nabla }_\varphi \log \pi (\varphi )\) by Itô formula (Itô 1944), the chain rule in stochastic calculus.

Change of random variable (CoRV) (Sect. 5): transformation f in random variable, obtaining \(\widehat{\nabla }_\varphi \log \pi (\varphi )\) using Jacobian \(\pi (\varphi ) = \pi _\theta (\theta )f'(\varphi )\).
Table 1 compares these approaches. Although the mirroring trick has previous studies and Itô formula is one of the most fundamental tools in stochastic analysis, many implementations and standard sampling software employ a method based on the CoRV approach without a theoretical investigation. In this paper, we analyze these methods and clarifies their properties. The theoretical results show that the CoRV approach has a pitfall in that some transform function amplifies the gradient noise and breaks the weak (distribution law) convergence of the EulerMaruyama discretization scheme. We show that the Lipschitz continuity of the transform function suppresses the noise and recovers the weak convergence. We also confirm that such transformation makes the CoRV algorithm stable near a domain boundary after transformation. The numerical experiments of Bayesian nonnegative matrix factorization and binary neural networks support the theories.
Technically, we also confirm the following propositions.

The Itô formula approach almost surely diverges around a domain boundary regardless of target distributions and transform functions (Theorem 2).

The CoRV approach with Lipschitz transform functions has a stationary distribution (Theorem 4) and converges weakly (Theorem 5) in the same order as SGLD on \({\mathbb{R}}\).
Note that our discussion follows the common setting of stochastic gradient MCMC as in previous studies (Welling and Teh 2011; Sato and Nakagawa 2014; Teh et al. 2016). That is, we adopt stochastic gradients with minibatch and omit a MetropolisHastings rejection step to avoid performance overhead on a massive dataset. Thus, we guarantee the sampling accuracy by discretization analysis of SDE instead of conforming to the detailed balance of a Markov chain.
2 Review: stochastic gradient MCMC
This section outlines the original stochastic gradient MCMC algorithm designed for unconstrained state space. Our notation uses a onedimensional parameter for simplicity. An extension of multidimensional cases is straightforward.
2.1 Langevin Monte Carlo
The Langevin Monte Carlo (LMC) is an efficient sampler for unbounded variables. Let us consider a target distribution \(\pi _\theta (\theta )\) and its potential \(U_\theta (\theta ) =  \log \pi _\theta (\theta )\). We put subscript \(_\theta\) to emphasize the distribution and potential represents the constrained random variable \(\theta\). The subscript is omitted when we discuss the unconstrained random variables. The LMC algorithm is
where \(U'_\theta (\theta ) = \frac{d}{d\theta } U_\theta (\theta )\), \({\mathcal{N}}(0,1)\) is the standard Gaussian distribution, and \(\epsilon _t>0\) is the step size.
As step size \(\epsilon _t \rightarrow 0\), the variable \(\theta _t\) becomes an Itô process described by an SDE
with the Wiener process W(t). In other words, the LMC algorithm (3) can be seen as a discretization of the SDE (4) called Langevin equation. From the Fokker–Planck equation, the Itô process \(\theta (t)\) is known to have the stationary distribution (Gardiner 2009)
Note that the LMC algorithm is capable of generating samples from the unnormalized form of density function \(\exp (U_\theta (\theta ))\). This property is favorable for handling Bayesian posteriors \(p(\theta x)=\frac{p(x\theta )p(\theta )}{\int p(x\theta )p(\theta ) d\theta }\), where the normalizing evidence term is often intractable.
2.2 Stochastic gradient Langevin dynamics
In Bayesian learning, the LMC algorithm is not scalable with respect to the number of data points. Let us consider that the target distribution is a Bayesian posterior of \(\theta\) given observation \(x = \{x_1, \dots , x_n \}\) with a prior \(p(\theta )\),
The potential \(U_\theta\) consists of the logarithm of prior and likelihood over the full data points. While LMC needs to calculate the derivative of potential
at each iteration, it is computationally intensive as the number of data points n grows large. This is critical in recent machine learning that uses big data.
To keep the time complexity constant, one can apply a stochastic approximation. Instead of the exact gradient, the stochastic gradient is computed via minibatch \(S_t\), only a subset of the full dataset,
It is a noisy approximation of \(U'_\theta (\theta )\). Replacing the exact gradient of LMC with stochastic gradient, the stochastic gradient Langevin dynamics (SGLD) algorithm generates the samples as follows:
SGLD also enjoys computational gain by omitting a MetropolisHastings rejection step, which ordinary MCMC methods usually run to ensure a detailed balance.
Due to the approximation of gradient and exclusion of the rejection step, SGLD may not necessarily satisfy the detailed balance of the Markov chain. Instead, the weak convergence with regard to SDE (4) has been discussed in the literature (Sato and Nakagawa 2014; Teh et al. 2016). Let stochastic gradient satisfy the following assumtion.
Assumption 1
(Gradient error) The stochastic gradient \(\widehat{U}'_\theta (\theta )\) is written by using the accurate gradient \(U'_\theta (\theta )\) and the error \(\delta\) as
where \(\delta\) is white noise or the Wiener process of zero mean and finite variance satisfying
for some integer \(l \ge 2\).
Then the following theorem holds for the sample sequence \(\{\theta _t\}_{t=1}^T\). In short, the weak convergence states that the discretization error of SGLD becomes zero in expectation for any fixed time where the time increment approaches zero.
Definition 1
(Weak convergence (Iacus 2008)) Let \(Y_\epsilon\) be a timediscretized approximation of a continuoustime process Y and \(\epsilon _0\) be the maximum time increment of the discretization. \(Y_\epsilon\) is said to converge weakly to Y if for any fixed time T and any continuous differentiable and polynomial growth function h and constant \(\epsilon _0>0\), it holds that
2.3 Other related work
Brosse et al. (2017) developed another line of research for an LMC algorithm for a random variable on a convex body. They employed proximal MCMC (Pereyra 2016; Durmus et al. 2018) with the MoreauYosida envelope, which finds a wellbehaved regularization of the target density on a convex body so that it preserves convexity and Lipschitzness. The sampling distribution is, nevertheless, an unbounded approximation of the target distribution, and it still draws samples from outside the domain. The limitation of logconcavity and the computing cost of the proximal operator at each sample prevent its application to large datasets as well as complex models such as neural networks.
3 Mirroring trick
Although many studies have been carried out for LMC and SGLD defined on real space \({\mathbb{R}}\), theoretical analysis in the constrained space \({\mathbb{R}}_c\) remains unsolved. The difficulty comes from that the LMC algorithm is a discretization of an Itô process whose equilibrium is a target distribution on \({\mathbb{R}}\). This is problematic in multiple applications where we handle latent random variables in a bounded domain, such as latent Dirichlet allocation (Blei et al. 2003) where \(\theta\) lies in a probability simplex, nonnegative matrix factorization (Cemgil 2009) with all elements of \(\theta\) being nonnegative, and binary neural networks (Courbariaux et al. 2015; Hubara et al. 2016) with \(\theta \in (1, 1)\).
The mirroring trick is one of the straightforward heuristics to cope with this problem. This trick sends back outgoing samples at the domain boundaries so as not to overstep the constraint. Patterson and Teh (2013) employed it to sample from a Gamma distribution defined on \({\mathbb{R}}_+\), merely taking the absolute value of the generated sample. There is no convergence guarantee for this trick, because it assumes that \(\widehat{\nabla } \log \pi (\varphi ) = \widehat{\nabla } \log \pi _\theta (\theta )\) and transformation f does not change the equilibrium. The heuristics is partially justified by Bubeck et al. (2015) and Bubeck et al. (2018). They assumed accurate gradients and extended the LMC algorithm to an SDE with a reflecting boundary condition. Their stochastic process defined on a convex body, called reflected Brownian motion, is discretized into an LMC algorithm accompanied by the mirroring trick. This interpretation helps its theoretical investigation. However, Bubeck et al. (2018) also stated that the extension of their result to SGLD with stochastic gradients is an open problem for future work.
Our preliminary experiments show that the mirroring trick empirically suffers from inaccurate sampling near the boundaries. Figure 1 (see mirror) indicates that the mirroring trick fails to capture the distribution especially when the density is sparse, or concentrated on boundaries. The sampling can be inaccurate when the model uses a sparse prior that is often employed to avoid overfitting. This disadvantage forces us to set a tiny step size for accurate sampling near the boundary, which results in substantial performance degradation in experiments in Sect. 7.
4 Itô formula
Here we consider the following twostep modification: first, we use the Itô formula to construct the SDE in the unconstrained domain with the corresponding transform function. Then, we discretize the SDE to obtain the desired algorithm. While this derivation is straightforward and theoretically appreciated, we later show that this transformation inherits an instability near the boundary.
In the stochastic differential equation, we have the following formula.
Theorem 1
(Itô Formula (Itô 1944)) Assume X (t) is an Itô process satisfying the stochastic differential equation
Let h (t, X(t)) be a given bounded function in \(C^2((0,\infty )\times {\mathbb{R}})\). Then, h (t, X(t)) satisfies the stochastic differential equation
where \({\mathcal{L}}_1\)and \({\mathcal{L}}_2\)are linear operators defined by
We begin by transforming the following Itô process of \(\theta (t)\),
Let \(g:{\mathbb{R}}_c \rightarrow {\mathbb{R}}\) be a smooth invertible function from a bounded target variable \(\theta \in {\mathbb{R}}_c\) to an unbounded proxy variable \(\varphi \in {\mathbb{R}}\). The constrained state space \({\mathbb{R}}_c\) can be a finite or semiinfinite interval of \({\mathbb{R}}\) (see Definition 2 and Table 2 for details). We consider a new stochastic process \(\varphi (t)\) defined by
From the Itô formula (Theorem 1), \(\varphi (t)\) is also an Itô process of
Letting \(a(\theta ) =  U'_\theta (\theta )\) and \(b = \sqrt{2}\), discretizing the process results in the following LMC
While a general connection between SDE and LMC is discussed by Ma et al. (2015), this algorithm is distinct in that the transform step \(\theta = g^{1}(\varphi )\) is employed to keep samples in the target domain.
Unfortunately, Eq. (19) is likely to draw inaccurate samples whether or not the gradient has noise. Figure 1 demonstrates that this method (labeled as Ito) fails to track the target density. We attribute this phenomenon to the intrinsic instability around the boundary regardless of the target potential and the transform function.
To theoretically discuss this instability, we first assume the following class of transform functions.
Assumption 2
(Transform function) Let f be a Lipschitz and monotonically increasing function. Namely, there exists constant \(L>0\) such that, for all \(\varphi \in {\mathbb{R}}\),
The boundary value of target domain denoted by \(\partial S\) corresponds to the infinity in the proxy space: \(\lim _{\varphi \rightarrow \infty } f(\varphi ) = \partial S\), and \(\lim _{\varphi \rightarrow \infty } f'(\varphi )\) exists.
All functions except the exponential in Table 3 satisfy this assumption.
Depending on the constraints in the target domain, f may be a decreasing or upper and lowerbounded function. We continue with Assumption 2 for simplicity, though our discussion is also applicable to the more general ddimensional case:
Definition 2
(Constrained state space) The constrained state space \({\mathbb{R}}_c^d\) is an ddimensional interval defined as a subset of \({\mathbb{R}}^d\) that is the Cartesian product of d (semi)finite intervals, \(I=I_{1}\times I_{2}\times \cdots \times I_{d}\), one on each coordinate axis.
The instability of the algorithm is shown in the following theorem.
Theorem 2
(Instability of the Itô transformation) Let \(f=g^{1}: {\mathbb{R}}\rightarrow {\mathbb{R}}_c\) satisfy Assumption 2. Then for any \(\epsilon >0\), and \(U'_\theta (\theta )\), and any \(\theta \in S\) approaching \(\partial S\) from the inside, the singlestep difference of the Itô transformation method diverges almost surely:
Please refer to Sect. 6 for all the proofs in this paper.
It suggests that the step size must be small enough to cope with this instability, but it would make the sampling substantially slow to mix.
5 Change of random variable
We consider another formulation to employ a transformation step in LMC. The derivation methodology is the opposite of the Itô transformation; we begin with a discretized algorithm and then consider the corresponding continuoustime SDE. We use this SDE representation to derive Theorem 5, which guarantees the sampling accuracy of the method without a rejection step. Also, this algorithm overcomes the instability issue by Theorem 3, unlike the former Itô method.
Let function \(f: {\mathbb{R}} \rightarrow {\mathbb{R}}_c\) be a twice differentiable monotonic function from an unbounded proxy variable \(\varphi \in {\mathbb{R}}\) to a bounded target variable \(\theta \in {\mathbb{R}}_c\)
then the target density \(\pi _\theta (\theta )\) and the proxy density \(\pi (\varphi )\) are known to have the following relation,
For the proxy potential \(U(\varphi ) \propto  \log \pi (\varphi )\), proxy \(U'(\varphi )\) is represented by given target \(U'_\theta (\theta ):\)
One can enjoy the computational gain using the stochastic gradient \(\widehat{U}'_\theta\), and construct the SGLD algorithm for the proxy variable:
We call this algorithm changeofrandomvariable (CoRV) SGLD. CoRV SGLD forms a generalized class of samplers that contains the ordinary SGLD. Indeed, we recover SGLD by using the identity function as the transform \(f(\varphi ) = \varphi\). CoRV SGLD satisfies the following advantages.

The algorithm maintains the efficient computational complexity of the stochastic gradient.
Equation (25) does not require iteration over the entire dataset.

The samples are always in the target constrained space \({\mathbb{R}}_c\). Equation (25) generates a proxy sample \(\varphi _t \in {\mathbb{R}}\) and then Eq. (22) transforms it into a target sample \(\theta _t \in {\mathbb{R}}_c\).

Any transform functions f can be employed in Eq. (25) if it is twice differentiable monotonic and \(\frac{f''(\varphi )}{f'(\varphi )}\) exists. Many common functions satisfy this condition, such as exponential, sigmoid, and softmax functions.
5.1 Stability
The following theorem explains the stability of CoRV SGLD by showing that the transformation does not cause an abrupt movement in the dynamics.
Theorem 3
(Stability of CoRV) Let transform function f satisfy Assumption 2. Then for a gradient error \(\delta _\varphi\) and for any \(\theta \in S\) approaching \(\partial S\) from the inside, we have:
5.2 Stationary distribution
We consider the following SDE of proxy variable \(\varphi\) as the continuous counterpart of Eq. (25)
so as to apply the tools of stochastic analysis. We confirm the existence and uniqueness of the weak solution and obtain its equilibrium.
Unlike the unconstrained case, a constrained target distribution \(\pi _\theta (\theta )\) often has a nonzero density at a domain boundary. The following lemma is required so that the unnormalized proxy distribution \(\int _{\varphi } \exp (U(\varphi ))d\varphi\) does not diverge.
Lemma 1
(Proxy potential) Let f satisfy Assumption 2, and a target pdf \(\pi _\theta (\theta )\) have a finite limit as \(\theta\) goes to the boundary. Then for any \(U(\varphi )\), we have:
Lemma 1 is enough for some cases (e.g. truncated normal). However, in order to show the same proposition for distributions that have an infinite density at a boundary (e.g., the beta and gamma), we need the following additional assumption.
Assumption 3
Let \(\pi _\theta\) be the target probability distribution. Transform function f satisfies
Under the existence and uniqueness of solution (see Sect. 6.2), we derive the stationary distribution of Eq. (27) as follows.
Theorem 4
(Stationary distribution) Let transform function f satisfy Assumptions 2 and 3. For transition probability density functions \(p(\varphi ,t)\) and \(p(\theta ,t)\) of the variables at time t, we have:
and
5.3 Weak convergence
We also check Eq. (25) does not break the unique weak solution of Eq. (27) by confirming that the discretization error is bounded. From Lemmas 1 and 5, the weak convergence is derived.
Theorem 5
(Weak convergence) Let transform function f satisfy Assumption 2.
For any continuously differentiable and polynomial growth functions h and \(h_\theta\), we have:
and
where \(\varphi (T)\) and \(\theta (T)\)denote the random variables at fixed time T, \(\widetilde{\varphi }(T)\) and \(\widetilde{\theta }(T)\)denote discretized samples at fixed time T by CoRV SGLD, and \(\epsilon _0>0\) is the initial step size.
Empirical result We empirically confirm Theorem 5 using basic distributions. The expectation of continuous process \(h_\theta (\theta (T))\) is substituted with its true expectation and the identity function \(h_\theta (\theta ) = \theta\) was selected. Specifically, we set \({\mathbb{E}}\left[ h_\theta (\theta (T))\right] = 0.25\) for the gamma distribution with its shape and scale being 0.5. Figure 2 shows the numerical errors corresponding to Eq. (33) for three sampling methods. We can see that the error of CoRV almost linearly scales with step size \(\epsilon _0\), as suggested by Theorem 5. The errors of mirror and Ito are significantly greater than CoRV. The smaller step sizes do not improve Ito, implying the difficulty for practical application.
6 Proofs
6.1 Proof of Lemma 1
Proof
From the assumption of the target pdf, and \(\varphi \rightarrow \infty\) as \(\theta \rightarrow \partial S\),
for constant \(C>0\). From Lemma 4,
Using
we have
\(\square\)
6.2 Solution existence and uniqueness of SDE (27)
We check the existence of the solution of the SDE (27). The following result is wellknown.
Lemma 2
(Solution existence) Let \(U'(\varphi )\) be a continuous function of \(\varphi\). Then the solution of the SDE (27) exists.
We also confirm the uniqueness of the solution. We employ the weak uniqueness for the uniqueness in the sense of a distribution law.
Theorem 6
(Weak uniqueness (Stroock and Varadhan 1979)) Consider a ddimensional SDE of \(X\in {\mathbb{R}}^d\),
Let a(x) be a bounded measurable function for \(x\in {\mathbb{R}}^d\). Let \(B(x) = b(x)^\intercal b(x)\) be a bounded, continuous function where constant \(K>0\) exists such that
for \(\zeta = (\zeta _1,\cdots ,\zeta _d) \in {\mathbb{R}}^d\). Then the uniqueness in the sense of a distribution law holds for the solution of the SDE (38).
The solution is unique under the following condition of \(U'(\varphi )\).
Lemma 3
(Solution uniqueness) Let the proxy potential gradient \(U'(\varphi )\) be a bounded function. Then the solution of the SDE (27) is unique in the sense of a distribution law.
Proof
From Theorem 6, the condition of the diffusion coefficient is straightforwardly confirmed by letting \(b = \sqrt{2}\) and \(\zeta \in {\mathbb{R}}\), there exists constant \(K>0\) such that
\(\square\)
6.3 Lemma 4
The following lemma is essential for showing the proxy potential and the solution of proxy SDE (27).
Lemma 4
(Limit of transform derivative) Under Assumption 2, we have
Proof
Using the L’Hôpital’s rule,
Thus we have
\(\square\)
6.4 Lemma 5
Lemma 5
(Proxy gradient error) Let \(\delta\) be a noise of the stochastic gradient of the target potential that satisfies Assumption 1, and let f satisfy Assumption 2. Then for any noise \(\delta _\varphi\) of the stochastic gradient of the proxy potential
we have:
for some integer \(l \ge 2\).
Proof
Since \(\widehat{U}'_\theta (\theta )\) satisfies Assumption 1
as in Eq. (24), the stochastic gradient of the proxy potential is
by letting \(\delta _\varphi = f'(\varphi ) \delta\). Since Assumption 2 suggests that the derivative of transform is always finite, \(\delta _\varphi\) also satisfies zero mean and finite variance
\(\square\)
6.5 Proof of Theorem 2
Proof
This implies that
From Eq. (19), the singlestep difference is given by
Considering \(\eta _t\sim {\mathcal{N}}(0,1)\), the factor \(g'(\theta )\) almost surely dominates this quantity. Therefore,
\(\square\)
6.6 Proof of Theorem 3
Proof
From Eq. (47) of Lemma 5 in Sect. 6.4, we have
where \(\delta _\varphi = f'(\varphi ) \delta\) and \(\delta\) satisfies Assumption 1. From Lemma 4,
\(\square\)
6.7 Proof of Theorem 4
Proof
From Lemmas 2 and 3, there exists a unique solution in the sense of a distribution law.
From Lemmas 1, 5 and Assumption 3, the SDE (27) satisfies the same assumption that Sato and Nakagawa (2014) used for SGLD in unconstrained state space. The transition probability density function \(p(\varphi ,t)\) follows the FokkerPlanck equation
and its stationary distribution is
Note that \(f'(\varphi (t))\) is always finite from Assumption 2. Applying Eq. (23), we obtain the stationary distribution as
\(\square\)
6.8 Proof of Theorem 5
Proof
Let us consider stochastic differential equation
and its approximation in time \(t_{k1} \le t \le t_k\)
where \(\widetilde{a}(\varphi (t)) = a(\varphi (t)) +\delta _{\varphi ,t}\).
Using Lemma 5 and Theorem 6 of Sato and Nakagawa (2014), for the test function h, we have
From the Weierstrass theorem, there exists constant \(C_k>0\) such that
for time \(t_{k1} \le t \le t_k\). Letting the maximum value of \(C_k\) be \(C_{\rm{max}}\) and \(\epsilon _{t_{k1}}\) be \(\epsilon _0\),
That is, the sample of proxy variable \(\varphi\) generated by Eq. (25) weakly converges
Let test function h be a composition of transform function f and test function \(h_\theta\) in the target domain: \(h(\cdot ) =h_\theta (f(\cdot )))\). Thus, \(h(\varphi (T)) = h_\theta (\theta (T))\) and \(h(\widetilde{\varphi }(T)) = h_\theta (\widetilde{\theta }(T))\). The sample of target variable \(\theta\) satisfies
\(\square\)
7 Experiments
In this section, we show the usefulness of our method using a range of models for many application scenarios. Results demonstrate a practical efficacy of the CoRV approach on top of the theoretical justifications that have been discussed. We used the P100 GPU accelerator for all experiments.
7.1 Bayesian NMF
For a typical application that uses a probability distribution supported on a finite or semiinfinite interval, we considered Bayesian nonnegative matrix factorization (Cemgil 2009). Given an observed \(I\times J\) matrix X, whose components take nonnegative values, we approximated it with a lowrank matrix product WH, where W is an \(I\times R\), and H is an \(R\times J\) nonnegative matrix. The prior distribution and likelihood are
where \(\lambda _W\) and \(\lambda _H\) are hyperparameters.
SGLD use the mirroring trick to generate samples with stochastic gradient evaluated with a minibatch as follows:
where noise \(\eta\) conforms to \({\mathcal{N}}(0,I)\) with I being the \(R\times R\) identity matrix. \(W_{i:}^*\) denotes the sample at time \(t+1\) given \(W_{i:}\) is the sample at time t. The absolute value is taken in an elementwise manner, which corresponds to the mirroring trick. The stochastic gradient is
where \(j_k \in \{1,\cdots ,J\}\) is the index of the kth data point in minibatch S, \(X_{k}\) is a nonnegative value of the kth data, and \(\widehat{X}_{k} = \sum _{r=1}^R W_{i_k r} H_{r j_k}\) is its estimate.
CoRV SGLD updates proxy variables by
Here \(f'\) and \(f''\) are applied elementbyelement. We draw the sample of \(\varphi _H\) in the same manner. Note that Eq. (70) bypasses the mirroring trick because proxy variables \(\varphi _W\) and \(\varphi _H\) are in the entire domain \({\mathbb{R}}\). Matrices W, H are always nonnegative via elementwise transform \(f: {\mathbb{R}} \rightarrow {\mathbb{R}}_+\). The algorithms are shown below.
Configuration In the experiments, we employed the MovieLens dataset, a commonly used benchmark for matrix factorization tasks (Ahn et al. 2015). The data matrix X consists of \(I = 71,567\) users and \(J = 10,681\) items with in total 10, 000, 054 nonzero entries. It was split into 75 : 12.5 : 12.5 for training, validation, and testing. We compared (1) CoRV, (2) the stateoftheart SGLDbased method (Ahn et al. 2015) with modification for nonnegative values, and (3) SGRLD (Patterson and Teh 2013) using natural gradient with diagonal preconditioning. Methods (2) and (3) used the mirroring trick. We evaluated each of the sampling methods through the Bayesian prediction accuracy. We also compared three transform functions for constraining to nonnegative variables: exp, softplus, and ICLL in Table 3. The Itô formulation was omitted due to significant numerical instability. The test root mean square error (RMSE) was used as a performance metric. The prediction was given by the Bayesian predictive mean computed by a moving average. We set the number of dimensions of latent variables R to 20 and 50. We trained for 10, 000 iterations with \(R=20\) and for 20, 000 with \(R=50\). The step size was chosen by TPE of 100 trials to minimize the validation loss. We set hyperparameter \(\lambda _W, \lambda _H\) to 1.0 and the size of the minibatch S to 10, 000.
Result Figure 3 shows the curves of root mean square error (RMSE) values as a function of iterations. SGLD and SGRLD are existing methods with the mirroring trick, whereas exp, softplus, and ICLL indicate our method with the specified transform function. We observed that CoRV SGLD made better predictions with smaller iterations than the other two algorithms. When \(R=20\) (Fig. 3 left), SGLD took 10, 000 iterations to reach an RMSE of 0.90, whereas CoRV SGLD (softplus and ICLL) achieves it with only 3, 000 iterations. While the choice of transform functions may influence the performance, CoRV outperformed the best performing baseline SGRLD.
Computational Complexity CoRV SGLD requires additional computation of the transformation step compared to the vanilla SGLD (see Algorithm 2). In most cases, gradient computation is dominant in the SGLD calculation, which is proportional to the number of data in each minibatch. CoRV SGLD depends only on the number of parameters and does not change the complexity of gradient computation. The influence on the computation time was limited, as the measured execution time was up to \(+10\%\) at the maximum.
7.2 Bayesian binary neural network
A binary neural network, whose parameters are restricted to binary, is expected to achieve high performance on small devices in terms of memory efficiency (Courbariaux et al. 2015; Hubara et al. 2016). We evaluated each of the sampling methods through the Bayesian prediction accuracy of a binary neural network model.
The weight parameter w was trained as a continuous variable and then binarized into \(\{1, +1\}\) by thresholding at 0 before the inference. A transformed beta prior was defined on a constrained state space of \((1,+1)\),
with hyperparameters \(\alpha ,\beta <1\) and the beta function B. It encourages the parameter to have high density around the boundary \(\{1,+1\}\), so that a method that can correctly sample the edges would be expected to keep high accuracy after binarization.
In the experiments, we considered a Bayesian binary threelayer feedforward network containing 50 hidden units with the ReLU activation. We employed MNIST (Lecun et al. 1998) and FashionMNIST (Xiao et al. 2017) datasets. Both datasets were split into 8 : 1 : 1 for training, validation, and testing. We compared (1) CoRV using the sigmoid, arctangent, and softsign functions, and (2) standard SGLD with the mirroring trick. The Itô formulation was omitted due to significant numerical instability. We evaluated the crossentropy loss of the softmax classifier and classification accuracy. The accuracy was given by the Bayesian predictive mean computed by a moving average of binarized weights at each epoch. We trained the networks for 100 epochs with MNIST and 300 epochs with FashionMNIST. The step size was chosen by TPE of 100 trials to minimize the validation loss.
Results Figure 4 presents the test loss and accuracy. Note that this experiment aims to compare sampling methods on the same model rather than to propose a stateoftheart network. The learning curves show that CoRV achieves better prediction than mirroring heuristics. It is useful in practice that transformation enables stable computation with a large step size.
8 Conclusion
SGLD has resorted to some heuristics for sampling bounded random variables since SGLD is designed for unbounded ones. We demonstrated such heuristics may sacrifice the sampling accuracy both empirically and theoretically. To deal with such random variables, we generalized SGLD using the changeofrandomvariable (CoRV) formulation and analyzed its weak convergence. Empirical evaluations showed that our CoRV SGLD outperformed existing heuristic alternatives on Bayesian nonnegative matrix factorization and neural networks.
References
Ahn, S., Korattikara, A., Liu, N., Rajan, S., & Welling, M. (2015). Largescale distributed Bayesian matrix factorization using stochastic gradient MCMC. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15), 9–18.
Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, 24, 2546–2554.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Brosse, N., Durmus, A., Moulines, É., & Pereyra, M. (2017). Sampling from a logconcave distribution with compact support with proximal Langevin Monte Carlo. In Proceedings of the 2017 Conference on Learning Theory, 319–342.
Bubeck, S., Eldan, R., & Lehec, J. (2015). Finitetime analysis of projected Langevin Monte Carlo. In Advances in Neural Information Processing Systems, 28, 1243–1251.
Bubeck, S., Eldan, R., & Lehec, J. (2018). Sampling from a logconcave distribution with projected Langevin Monte Carlo. Discrete & Computational Geometry, 59(4), 757–783.
Cemgil, A. T. (2009). Bayesian inference for nonnegative matrix factorisation models. Computational Intelligence and Neuroscience. https://doi.org/10.1155/2009/785152.
Courbariaux, M., Bengio, Y., & David, J. P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, 28, 3123–3131.
Durmus, A., Moulines, É., & Pereyra, M. (2018). Efficient Bayesian computation by proximal Markov chain Monte Carlo: When Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1), 473–506.
Gardiner, C. (2009). Stochastic methods: A handbook for the natural and social sciences (4th ed.), Springer series in synergetics Berlin: Spinger.
Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., & Bengio, Y. (2016). Binarized neural networks. In Advances in Neural Information Processing Systems, 29, 4107–4115.
Iacus, S. M. (2008). Simulation and inference for stochastic differential equations, Springer series in statistics New York: SpringerVerlag.
Itô, K. (1944). Stochastic integral. In Proceedings of Imperial Academy, Tokyo.
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradientbased learning applied to document recognition. In Proceedings of IEEE, 86, 2278–2324.
Ma, Y. A., Chen, T., & Fox, E. B. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, 28, 2917–2925.
Patterson, S., & Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, 26, 3102–3110.
Pereyra, M. (2016). Proximal Markov chain Monte Carlo algorithms. Statistics and Computing, 26(4), 745–760.
Sato, I., & Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevin dynamics by using Fokker–Planck equation and Ito process. In Proceedings of the 31st International Conference on Machine Learning (ICML14), 982–990.
Stroock, D., & Varadhan, S. (1979). Multidimensional diffusion processes. Berlin: Springer.
Teh, Y. W., Thiery, A. H., & Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17, 193–225.
Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), 681–688.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). FashionMNIST: A novel image dataset for benchmarking machine learning algorithms arXiv:1708.07747.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Rights and permissions
About this article
Cite this article
Yokoi, S., Otsuka, T. & Sato, I. Weak approximation of transformed stochastic gradient MCMC. Mach Learn 109, 1903–1923 (2020). https://doi.org/10.1007/s10994020059045
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994020059045