Appendix 1: Convergence to the correct target distribution
In order to prove that the equilibrium distribution remains the same, it suffices to show that the detailed balance condition still holds. Denote \(\theta = (p,\,q),\; \theta ^{\prime }={\tilde{\phi }}_t(\theta ).\) In the Metropolis–Hasting step, we use the original Hamiltonian to compute the acceptance probability
$$\begin{aligned} \alpha (\theta ,\,\theta ^{\prime }) = \min (1,\,\exp [-H(\theta ^{\prime }) + H(\theta )]), \end{aligned}$$
therefore,
$$\begin{aligned} \alpha (\theta ,\,\theta ^{\prime }){\mathbb {P}}(\mathrm{d}\theta )&= \alpha (\theta ,\,\theta ^{\prime }) \exp [-H(\theta )]\mathrm{d}\theta \\&\mathop {=}\limits ^{\theta = {\tilde{\phi }}^{-1}_{t}(\theta ^{\prime })} \min (\exp [-H(\theta )],\,\exp [-H(\theta ^{\prime })])\left| \frac{\mathrm{d}\theta }{\mathrm{d}\theta ^{\prime }}\right| \mathrm{d}\theta ^{\prime } \\&= \alpha (\theta ^{\prime },\,\theta )\exp [-H(\theta ^{\prime })]\mathrm{d}\theta ^{\prime }\\&= \alpha (\theta ^{\prime },\,\theta ) {\mathbb {P}}(\mathrm{d}\theta ^{\prime }), \end{aligned}$$
since \( \left| \frac{\mathrm{d}\theta }{\mathrm{d}\theta ^{\prime }}\right| = 1\) due to the volume conservation property of the surrogate-induced Hamiltonian flow \({\tilde{\phi }}_t.\) Now that we showed the detailed balance condition is satisfied, along with the reversibility of the surrogate-induced Hamiltonian flow, the modified Markov chain will converge to the correct target distribution.
Appendix 2: Potential matching
In the paper, training data collected from the history of Markov chain are used without a detailed explanation. Here, we will analyze the asymptotical behavior of surrogate-induced distribution and try to explain why the history of the Markov chain is a reasonable choice for training data. Recall that we find our neural network surrogate function by minimizing the mean square error (11). Similarly to Hyvärinen (2005), it turns out that minimizing (11) is asymptotically equivalent to minimizing a new distance between the surrogate-induced distribution and the underlying target distribution, independent of their corresponding normalizing constants.
Suppose we know the density of the underlying intractable target distribution up to a constant
$$\begin{aligned} P(q|Y) = \frac{1}{Z}\exp (-U(q)), \end{aligned}$$
where Z is the unknown normalizing constant. Our goal is to approximate this distribution using a parametrized density model, also known up to a constant,
$$\begin{aligned} Q(q,\,\theta ) = \frac{1}{Z(\theta )}\exp (-V(q,\,\theta )). \end{aligned}$$
Ignoring the multiplicative constant, the corresponding potential energy functions are U(q) and \(V(q,\,\theta ),\) respectively. The straightforward square distance between the two potentials will not be a well-defined measure between the two distributions because of the unknown normalizing constants. Therefore, we use the following distance instead:
$$\begin{aligned} K(\theta )&= \min _d\int \Vert V(q,\,\theta )-U(q)-d\Vert ^2P(q|Y)\mathrm{d}q\nonumber \\&=\int \Vert V(q,\,\theta )-U(q)\Vert ^2 P(q|Y)\mathrm{d}q \nonumber \\&\quad - \left[ E_q(V(\theta )-U)\right] ^2 = {\mathbb {V}}\mathrm {ar}_{q}(V(\theta )-U). \end{aligned}$$
(14)
Because of its similarity to score matching (Hyvärinen 2005), we refer to the approximation method based on this new distance as potential matching; the corresponding potential matching estimator of \(\theta \) is given by
$$\begin{aligned} {\hat{\theta }} = \arg \min _{\theta } K(\theta ). \end{aligned}$$
It is easy to verify that \(K(\theta ) = 0 \Rightarrow V(\theta ) = U + \mathrm{constant} \Rightarrow Q(q,\,\theta ) = P(q|Y),\) so \(K(\theta )\) is a well-defined squared distance. Exact evaluation of (14) is analytically intractable. In practice, given N samples from the target distribution \(q_1,\,q_2,\ldots ,q_N,\) we minimize the empirical version of (14)
$$\begin{aligned} {\tilde{K}}(\theta )&= \min _d\frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta \right) - U\left( q_n\right) -d\right\| ^2 \nonumber \\&= \frac{1}{N}\sum _{n=1}^N\left\| V\left( q_n,\,\theta \right) -U\left( q_n\right) \right\| ^2 \nonumber \\&\quad - \left( \frac{1}{N}\sum _{n=1}^NV\left( q_n,\,\theta \right) -U\left( q_n\right) \right) ^2, \end{aligned}$$
(15)
which is asymptotically equivalent to K by law of large numbers. (15) could be more concise if we allow a shift term in the parametrized model (\(V(q,\,\theta ) = V(q,\,\theta _{-d}) + \theta _d\)). In that case, the empirical potential matching estimator is
$$\begin{aligned} {\tilde{\theta }}&= \arg \min _\theta {\tilde{K}}(\theta ) = \arg \min _\theta \min _d\frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta _{-d}\right) \right. \\&\left. \quad +\left( \theta _d-d\right) - U\left( q_n\right) \right\| ^2\\&= \arg \min _\theta \frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta _{-d}\right) + \theta _d - U\left( q_n\right) \right\| ^2\\&= \arg \min _\theta \frac{1}{N}\sum _{n=1}^N \left\| V\left( q_n,\,\theta \right) - U\left( q_n\right) \right\| ^2. \end{aligned}$$
In particular, if we adopt an additive model for \(V(q,\,\theta )\) with a shift term
$$\begin{aligned} V(q,\,\theta ) = \sum _{i=1}^sv_i\sigma \left( \varvec{w_i}\cdot q+d_i\right) + b,\quad \theta = (\varvec{v},\,b), \end{aligned}$$
where \(\varvec{w_i},\,d_i\), and activation function \(\sigma \) are chosen according to Algorithm 2 and collect early evaluations from the history of Markov chain
$$\begin{aligned} {\mathcal {T}}_N= & {} \left\{ \left( q^{(1)},\,U\left( q^{(1)}\right) \right) ,\,\left( q^{(2)},\,U\left( q^{(2)}\right) \right) ,\ldots ,\right. \\&\times \,\left. \left( q^{(N)},\,U\left( q^{(N)}\right) \right) \right\} , \end{aligned}$$
as training data; this way, the estimated parameters (i.e., weights for the output neuron) are asymptotically the potential matching estimates
$$\begin{aligned} \lim _{N\rightarrow \infty }{\hat{\theta }}_{\mathrm{ELM},{\mathcal {T}}_N} = \arg \min _{\theta }\lim _{N\rightarrow \infty }C\left( \theta |{\mathcal {T}}_N\right) = {\hat{\theta }}, \end{aligned}$$
since the Markov chain will eventually converge to the target distribution. When truncated at a finite N, the estimated parameters are almost the empirical potential matching estimates except that the samples from the history of the Markov chain are not exactly (but quite close) from the target distribution.
Appendix 3: Adaptive learning
Theorem 1
(Greville) If a matrix A, with k columns, is denoted by \(A_k\) and partitioned as \(A_k=[A_{k-1}a_k],\) with \(A_{k-1}\) a matrix having \(k-1\) columns, then the Moore–Penrose generalized inverse of \(A_k\) is
$$\begin{aligned} A_k^{\dagger } =\begin{bmatrix}A_{k-1}^{\dagger }(I-a_kb_k^\mathrm{T})\\b_k^\mathrm{T}\end{bmatrix}, \end{aligned}$$
where
$$\begin{aligned} c_k = \left( I-A_{k-1}A_{k-1}^{\dagger }\right) a_k, \quad b_k = \left\{ \begin{array}{ll} \frac{(A_{k-1}^{\dagger })^\mathrm{T}A_{k-1}^{\dagger }a_k}{1+ \Vert A_{k-1}^\dagger a_k\Vert ^2}, &{}c_k=0,\\ \frac{c_k}{\Vert c_k\Vert ^2}, &{} c_k\ne 0. \end{array}\right. \end{aligned}$$
Proof of Proposition 1
To save computational cost, we only update the estimator. Suppose the current output matrix is \(H_{k+1}=\begin{bmatrix}H_k\\h_{k+1}^\mathrm{T}\end{bmatrix}\) and the target vector is \(T_{k+1}=\begin{bmatrix}T_k\\t_{k+1}\end{bmatrix},\) then
$$\begin{aligned} v_{k+1}^\mathrm{T}H_{k+1}^\mathrm{T}&= T_{k+1}^\mathrm{T} \Rightarrow v_{k+1}^\mathrm{T} \\&= T_{k+1}^\mathrm{T}\left( H_{k+1}^\mathrm{T}\right) ^{\dagger } \\&= \left[ T_k^\mathrm{T} t_{k+1}\right] \left( \left[ H_k^\mathrm{T} h_{k+1}\right] \right) ^{\dagger }\\&= \left[ T_k^\mathrm{T} t_{k+1}\right] \begin{bmatrix}(H_k^\mathrm{T})^{\dagger }(I-h_{k+1}b_{k+1}^\mathrm{T})\\b_{k+1}^\mathrm{T}\end{bmatrix}\\&= T_k^\mathrm{T}\left( H_k^\mathrm{T}\right) ^{\dagger }\left( I-h_{k+1}b_{k+1}^\mathrm{T}\right) + t_{k+1}b_{k+1}^\mathrm{T}\\&= v_k^\mathrm{T} + \left( t_{k+1}-v_k^\mathrm{T}h_{k+1}\right) b_{k+1}^\mathrm{T}, \end{aligned}$$
where \(b_{k+1}\) is obtained according to Theorem 1. Note that the computation of \(b_{k+1}\) and \(c_{k+1}\) still involves \(H_k\) and \(H_k^{\dagger }\) whose sizes increase as data size grows. Following Kohonen (1988), Kovanic (1979), and Schaik and Tapson (2015), we introduce two auxiliary matrices here
$$\begin{aligned}&\varPhi _{k} = I-H_k^\mathrm{T}\left( H_k^\mathrm{T}\right) ^{\dagger } \in {\mathbb {R}}^{s\times s},\quad \\&\varTheta _{k}= \left( \left( H_k^\mathrm{T}\right) ^{\dagger }\right) ^\mathrm{T}\left( H_k^\mathrm{T}\right) ^{\dagger }= H_k^{\dagger }\left( H_k^\mathrm{T}\right) ^{\dagger } \in {\mathbb {R}}^{s\times s}, \end{aligned}$$
and rewrite \(b_{k+1}\) and \(c_{k+1}\) as
$$\begin{aligned} c_{k+1} = \varPhi _{k}h_{k+1}, \quad b_{k+1} = \left\{ \begin{array}{ll} \frac{\varTheta _kh_{k+1}}{1+ h_{k+1}^\mathrm{T}\varTheta _kh_{k+1}}, &{}c_{k+1}=0,\\ \frac{c_{k+1}}{\Vert c_{k+1}\Vert ^2}, &{} c_{k+1}\ne 0. \end{array}\right. \end{aligned}$$
Updating of the two auxiliary matrices can also be done adaptively
$$\begin{aligned} \varPhi _{k+1}= & {} I - H_{k+1}^\mathrm{T}\left( H_{k+1}^\mathrm{T}\right) ^{\dagger } \\= & {} I-\left[ H_k^\mathrm{T}h_{k+1}\right] \begin{bmatrix}(H_k^\mathrm{T})^{\dagger }(I-h_{k+1}b_{k+1}^\mathrm{T})\\b_{k+1}^\mathrm{T}\end{bmatrix}\\= & {} \varPhi _k -\varPhi _kh_{k+1}b_{k+1}^\mathrm{T},\\ \varTheta _{k+1}= & {} H_{k+1}^{\dagger }\left( H_{k+1}^\mathrm{T}\right) ^{\dagger } \\= & {} \left[ \left( I-b_{k+1}h_{k+1}^\mathrm{T}\right) H_k^{\dagger }b_{k+1}\right] \begin{bmatrix}(H_k^\mathrm{T})^{\dagger }(I-h_{k+1}b_{k+1}^\mathrm{T})\\b_{k+1}^\mathrm{T}\end{bmatrix}\\= & {} \left( I-b_{k+1}h_{k+1}^\mathrm{T}\right) \varTheta _k\left( I-h_{k+1}b_{k+1}^\mathrm{T}\right) + b_{k+1}b_{k+1}^\mathrm{T}, \end{aligned}$$
and if \(c_{k+1} = 0,\) these formulas can be simplified as below
$$\begin{aligned} \varPhi _{k+1} = \varPhi _k,\quad \varTheta _{k+1} = \varTheta _k - \varTheta _kh_{k+1}b_{k+1}^\mathrm{T}. \end{aligned}$$
\(\square \)