Appendix A: De-biased Group LASSO Estimator
In this subsection, we derive a de-biased group LASSO estimator. Our construction is essentially the same as the one presented in van de Geer (2016).
With \(\mathcal {V}_{j}\) as defined in Eq. 11, let \(\mathcal {V}_{-j}^{g} = \left (\mathcal {V}^{g}_{1},\ldots ,\mathcal {V}^{g}_{j-1}, \mathcal {V}^{g}_{j+1},\ldots , \mathcal {V}^{g}_{p}\right )\) be an n × (p − 1)d dimensional matrix. For
, let \(\boldsymbol {\alpha }_{j} = \left (\alpha _{j,1}^{\top }, \ldots , \alpha _{j,p}^{\top }\right )^{\top }\), let \(\mathcal {P}_{j}\left (\boldsymbol {\alpha }_{j} \right ) = {\sum }_{k \neq j} \left \| \alpha _{j,k} \right \|_{2}\), and let \(\nabla \mathcal {P}_{j}\) denote the sub-gradient of \(\mathcal {P}_{j}\). We can express the sub-gradient as \(\nabla \mathcal {P}_{j}(\boldsymbol {\alpha }_{j}) =\\ \left ((\nabla \|\alpha _{j,1}\|_{2})^{\top }, \cdots , (\nabla \|\alpha _{j,p}\|_{2})^{\top } \right )^{\top }\) where ∇∥αj,k∥2 = αj,k/∥αj,k∥2 if ∥αj,k∥2≠ 0, and ∇∥αj,k∥2 is otherwise a vector with ℓ2 norm less than one. The KKT conditions for the group LASSO imply that the estimate \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) satisfies
$$ \left( n^{g}\right)^{-1}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) = -\lambda \nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right). $$
With some algebra, we can rewrite this as
$$ \left( n^{g}\right)^{-1}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \mathcal{V}_{-j}^{g} \left( \tilde{\boldsymbol{\alpha}}_{j}^{g}- \boldsymbol{\alpha}^{g,*}_{j}\right) = -\lambda \nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) + \left( \mathcal{V}^{g}_{-j}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \boldsymbol{\alpha}^{g,*}_{j} \right). $$
Let Σj be defined as the matrix
and let \(\tilde {M}_{j}\) be an estimate of \({{\varSigma }}_{j}^{-1}\). We can write \(\left (\tilde {\boldsymbol {\alpha }}^{g}_{j} - \tilde {\boldsymbol {\alpha }}^{g,*}_{j}\right )\) as
$$ \begin{array}{@{}rcl@{}} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \right) &= &\underset{\mathrm{(i)}}{\underbrace{-\lambda \tilde{M}_{j}\nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right)}} + \underset{\mathrm{(ii)}}{\underbrace{\left( n^{g}\right)^{-1}\tilde{M}_{j}\left( \mathcal{V}^{g}_{-j}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \boldsymbol{\alpha}^{g,*}_{j} \right)}} + \\ &&\underset{\mathrm{(iii)}}{\underbrace{\left\{I - \left( n^{g}\right)^{-1}\tilde{M}_{j}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \mathcal{V}_{-j}^{g} \right\} \left( \tilde{\boldsymbol{\alpha}}_{j}^{g} - \boldsymbol{\alpha}^{g,*}_{j}\right)}}. \end{array} $$
(A.1)
The first term (i) in Eq. A.1 is an approximation for the bias of the group LASSO estimate. This term is a function only of the observed data and not of any unknown quantities. This term can therefore be directly added to the initial estimate \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\). If \(\tilde {M}_{j}\) is a consistent estimate of \({{\varSigma }}_{j}^{-1}\), the second term (ii) is asymptotically equivalent to
$$ {{\varSigma}}^{-1}_{j} \left( \mathcal{V}^{g}_{-j}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \boldsymbol{\alpha}^{g,*}_{j} \right). $$
Thus, (ii) is asymptotically equivalent to a sample average of mean zero i.i.d. random variables. The central limit theorem can then be applied to establish convergence in distribution to the multivariate normal distribution at an n1/2 rate for any low-dimensional sub-vector. The third term will also be asymptotically negligible if \(\tilde {M}_{j}\) is an approximate inverse of \((n^{g})^{-1}\left (\mathcal {V}_{-j}^{g}\right )^{\top }\mathcal {V}^{g}_{-j}\). This would suggest that an estimator of the form
$$ \check{\boldsymbol{\alpha}}_{j}^{g} = \tilde{\boldsymbol{\alpha}}_{j}^{g} + \lambda \tilde{M}_{j} \nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) $$
will be asymptotically normal for an appropriate choice of \(\tilde {M}_{j}\).
Before describing our construction of \(\tilde {M}_{j}\), we find it helpful to consider an alternative expression for \({{\varSigma }}^{-1}_{j}\). We define the d × d matrices \({{\varGamma }}^{*}_{j,k,l}\) as
We also define the d × d matrix \({C}^{*}_{j,k}\) as
It can be shown that \({{\varSigma }}^{-1}_{j}\) can be expressed as
$$ {{\varSigma}}^{-1}_{j} = \begin{pmatrix} \left( C_{j,1}^{*}\right)^{-1} & {\cdots} & \mathbf{0} \\ {\vdots} & {\ddots} & \vdots \\ \mathbf{0} & {\cdots} & \left( C_{j,p}^{*}\right)^{-1} \end{pmatrix} \begin{pmatrix} I & -{{\varGamma}}^{*}_{j,1,2} & {\cdots} & -{{\varGamma}}^{*}_{j,1,p} \\ -{{\varGamma}}^{*}_{j,2,1} & I & {\cdots} & -{{\varGamma}}^{*}_{j,2,p} \\ {\vdots} & {\vdots} & {\ddots} & \vdots \\ -{{\varGamma}}^{*}_{j,p,1} & -{{\varGamma}}^{*}_{j,p,2} & {\cdots} & I \end{pmatrix} . $$
We can thus estimate \({{\varSigma }}_{j}^{-1}\) by performing a series of regressions to estimate each matrix \({{\varGamma }}^{*}_{j,k,l}\).
Following the approach of van de Geer et al. (2014), we use a group LASSO variant of the nodewise LASSO to construct \(\tilde {M}_{j}\). To proceed, we require some additional notation. For any d × d matrix Γ = (γ1,…,γd) for d −dimensional vectors γc, let \(\|{{\varGamma }} \|_{2,*} = {\sum }_{c = 1}^{d} \|\gamma _{c}\|_{2}\). Let \(\nabla \| {{\varGamma }} \|_{2,*} = \left (\gamma _{1}/\|\gamma _{1}\|_{2},\ldots ,\gamma _{d}/ \|\gamma _{d}\|_{2} \right )\) be the subgradient of ∥Γ∥2,∗. We use the group LASSO to obtain estimates \(\tilde {{{\varGamma }}}_{j,k,l}\) of \({{\varGamma }}^{*}_{j,k,l}\):
We then estimate \(C^{*}_{j,k}\) as
$$ \tilde{C}_{j,k} = \left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq k,j} \mathcal{V}^{g}_{l} \tilde{{{\varGamma}}}_{j,k,l} \right)^{\top}\left( \mathcal{V}_{k}^{g}\right). $$
Our estimate \(\tilde {M}_{j}\) takes the form
$$ \tilde{M}_{j} = \begin{pmatrix} \tilde{C}^{-1}_{j,1} & {\cdots} & \mathbf{0} \\ {\vdots} & {\ddots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}^{-1}_{j,p} \end{pmatrix} \begin{pmatrix} I & -\tilde{{{\varGamma}}}_{j,1,2} & {\cdots} & -\tilde{{{\varGamma}}}_{j,1,p} \\ -\tilde{{{\varGamma}}}_{j,2,1} & I & {\cdots} & -\tilde{{{\varGamma}}}_{j,2,p} \\ {\vdots} & {\vdots} & {\ddots} & \vdots \\ -\tilde{{{\varGamma}}}_{j,p,1} & -\tilde{{{\varGamma}}}_{j,p,2} & {\cdots} & I \end{pmatrix} . $$
With this construction of \(\tilde {M}_{j}\), we can establish a bound on the remainder term (iii) in Eq. A.1. To show this, we make use of the following lemma, which states a special case of the dual norm inequality for the group LASSO norm \(\mathcal {P}_{j}\) (see, e.g., Chapter 6 of van de Geer (2016)).
Lemma 1.
Let a1,…,ap and b1,…,bp be d-dimensional vectors, and let \(\mathbf {a} = \left (a_{1}^{\top },\ldots ,a_{p}^{\top }\right )^{\top }\) and \(\mathbf {b} = \left (b_{1}^{\top },\dots ,b_{p}^{\top }\right )^{\top }\) be pd-dimensional vectors. Then
$$ \langle \mathbf{a}, \mathbf{b}\rangle \leq \left( {\sum}_{j=1}^{p} \|a_{j}\|_{2} \right) \max_{j} \left\| b_{j} \right\|_{2}. $$
The KKT conditions for Eq. A.3 imply that for all l≠j,k
$$ \left( n^{g}\right)^{-1}\left( \mathcal{V}^{g}_{l}\right)^{\top}\left( \mathcal{V}^{g}_{k} - {\sum}_{r \neq k,j} \mathcal{V}^{g}_{r} \tilde{{{\varGamma}}}_{j,k,r}\right) = -\omega \nabla \left\| \tilde{{{\varGamma}}}_{j,k,l} \right\|_{2,*}. $$
(A.4)
Lemma 1 and Eq. A.4 imply that
$$ \left\| \begin{pmatrix} \tilde{C}_{j,1} & {\cdots} & \mathbf{0} \\ {\vdots} & {\ddots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}_{j,p} \end{pmatrix} \left\{I - \left( n^{g}\right)^{-1}\tilde{M}_{j}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \mathcal{V}_{-j}^{g} \right\} \left( \tilde{\boldsymbol{\alpha}}_{j}^{g} - \boldsymbol{\alpha}^{g,*}_{j}\right) \right\|_{\infty} \leq \omega \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}_{j}^{g} - \boldsymbol{\alpha}^{g,*}_{j}\right), $$
where \(\|\cdot \|_{\infty }\) is the \(\ell _{\infty }\) norm. With \(\omega \asymp \left \{\log (p)/n\right \}^{1/2}\), \(\tilde {M}_{j}\) can be shown to be consistent under sparsity of \({{\varGamma }}^{*}_{j,k,l}\) (i.e., only a few matrices \({{\varGamma }}^{*}_{j,k,l}\) have some nonzero columns) and some additional regularity conditions. Additionally, it can be shown under sparsity of αg,∗ (i.e., very few vectors \(\alpha ^{g,*}_{j,k}\) are nonzero) and some additional regularity conditions that \(\mathcal {P}_{j}\left (\tilde {\boldsymbol {\alpha }}_{j}^{g} - \boldsymbol {\alpha }_{j}^{g,*} \right ) = O_{P}\left (\left \{\log (p)/n \right \}^{1/2}\right )\). Thus, a scaled version of the remainder term (iii) is oP(n− 1/2) if \(n^{-1/2}\log (p) \to 0\). We refer readers to Chapter 8 of Bühlmann and van de Geer (2011) for a more comprehensive discussion of assumptions required for consistency of the group LASSO.
We now express the de-biased group LASSO estimator for \(\alpha ^{g,*}_{j,k}\) as
$$ \check{\alpha}^{g}_{j,k} = \tilde{\alpha}^{g}_{j,k} + \left( n^{g}\right)^{-1} \tilde{C}^{-1}_{j,k} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} \tilde{{{\varGamma}}}_{j,k,l} \mathcal{V}_{l}^{g} \right)^{\top} \left( \mathbf{X}^{g}_{j} - \mathcal{V}^{g}_{-j} \tilde{\boldsymbol{\alpha}}^{g}_{j} \right). $$
(A.5)
We have established that \(\check {\alpha }^{g}_{j,k}\) can be written as
$$ \tilde{C}_{j,k} \left( \check{\alpha}^{g}_{j,k} - \alpha^{g,*}_{j,k}\right) = \left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} {{\varGamma}}^{*}_{j,k,l} \mathcal{V}_{l}^{g} \right)^{\top} \left( \mathbf{X}^{g}_{j} - \mathcal{V}^{g}_{-j} \boldsymbol{\alpha}^{g,*}_{j} \right) + o_{P}(n^{-1/2}). $$
As stated above, the central limit theorem implies asymptotic normality of \(\check {\alpha }^{g}_{j,k}\).
We now construct an estimate for the variance of \(\check {\alpha }^{g}_{j,k}\). Suppose the residual \(\mathbf {X}^{g}_{j} - \mathcal {V}^{g}_{-j} \boldsymbol {\alpha }^{g,*}_{j}\) is independent of \(\mathcal {V}^{g}\), and let \({\tau _{j}^{g}}\) denote the residual variance
We can approximate the variance of \(\check {\alpha }^{g}_{j,k}\) as
$$ \check{{{\varOmega}}}^{g}_{j,k} = \left( n^{g}\right)^{-2}{\tau_{j}^{g}} \tilde{C}^{-1}_{j,k} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} \tilde{{{\varGamma}}}_{j,k,l} \mathcal{V}_{l}^{g} \right)^{\top} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} \tilde{{{\varGamma}}}_{j,k,l} \mathcal{V}_{l}^{g} \right) \left( \tilde{C}^{-1}_{j,k}\right)^{\top}. $$
(A.6)
As \({\tau _{j}^{g}}\) is typically unknown, we instead us the estimate
$$ \tilde{\tau}_{j}^{g} = \frac{\left\| \mathbf{X}^{g}_{j} - \mathcal{V}^{g}_{-j} \tilde{\boldsymbol{\alpha}}^{g}_{j} \right\|_{2}^{2}}{n - \widehat{df}}, $$
where \(\widehat {df}\) is an estimate of the degrees of freedom for the group LASSO estimate \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\). In our implementation, we use the estimate proposed by Breheny and Huang (2009). Let \(\tilde {\alpha }^g_{j,k,l}\) be the l-th element of \(\tilde {\alpha }^g_{j,k}\), and let \(\mathcal {V}^g_{k,l}\) denote the l-th column of \(\mathcal {V}^g_k\). We then define
$$ \begin{array}{@{}rcl@{}} \bar{\alpha}^g_{j,k,l} = \frac{\langle \mathbf{X}^g_{j} - \mathcal{V}^g_{-j}\tilde{\boldsymbol{\alpha}}^g_j + \mathcal{V}^g_{k,l}\tilde{\alpha}^g_{j,k,l}, \mathcal{V}^g_{k,l}\rangle }{\langle \mathcal{V}^g_{k,l} , \mathcal{V}^g_{k,l} \rangle}, \end{array} $$
and estimate the degrees of freedom as
$$ \begin{array}{@{}rcl@{}} \hat{df} = {\sum}_{k \neq j}{\sum}_{l=1}^{d} \frac{\tilde{\alpha}^{g}_{j,k,l}}{\bar{\alpha}^{g}_{j,k,l}}. \end{array} $$
Appendix B: Generalized Score Matching Estimator
In this section, we establish consistency of the regularized score matching estimator and derive a bias-corrected estimator.
B.1 Form of Generalized Score Matching Loss
Below, we restate Theorem 3 of Yu et al. (2019), which provides conditions under which the score matching loss in Eq. 20 can be expressed as Eq. 21.
Theorem 1.
Assume the following conditions hold:
where the prime symbol denotes the element-wise derivative. Then Eqs. 20 and 21 are equivalent up to an additive constant that does not depend on h.
B.2 Generalized Score Matching Estimator in Low Dimensions
In this section, we provide an explicit form for the generalized score matching estimator in the low-dimensional setting and state its limiting distribution. We first introduce some additional notation below that allows for the generalized score matching loss to be written in a condensed form. Recall the form of the conditional density for the pairwise interaction model in Eq. 22. We define
$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\mathcal{V}^{g}_{j,k,1} = \begin{pmatrix} v_{j}^{1/2}\left( X^{g}_{1,j}\right)\dot{\psi}\left( X^{g}_{1,j}, X^{g}_{1,k}\right) \times \phi\left( {W^{g}_{1}}\right) \\ \vdots \\ v_{j}^{1/2}\left( X^{g}_{n^{g},j}\right) \dot{\psi}\left( X^{g}_{n^{g},j}, X^{g}_{n^{g},k}\right) \times \phi\left( W^{g}_{n^{g}}\right) \end{pmatrix}, \\ \\ &&\!\!\!\!\!\mathcal{V}^{g}_{2,j} = \begin{pmatrix} v_{j}^{1/2}\left( X^{g}_{1,j}\right) \times \left\{ \dot{\zeta}\left( X^{g}_{1,j}, \phi_{1}({W^{g}_{1}})\right),\cdots,\dot{\zeta}\left( X^{g}_{1,j}, \phi_{d}({W^{g}_{1}})\right) \right\} \\ \vdots \\ v_{j}^{1/2}\left( X^{g}_{n^{g},j}\right) \times \left\{ \dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{1}(W^{g}_{n^{g}})\right),\cdots,\dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{d}(W^{g}_{n^{g}})\right) \right\} \end{pmatrix},\\\\ &&\!\!\!\!\!\mathcal{U}^{g}_{j,k,1} = \begin{pmatrix} \left\{\dot{v}_{j}\left( X^{g}_{1,j}\right)\dot{\psi}\left( X^{g}_{1,j}, X^{g}_{1,k}\right) + v_{j}\left( X^{g}_{1,j}\right)\ddot{\psi}\left( X^{g}_{1,j}, X^{g}_{1,k}\right) \right\} \times \phi\left( {W^{g}_{1}}\right) \\ \vdots \\ \left\{\dot{v}_{j}\left( X^{g}_{1,j}\right)\dot{\psi}\left( X^{g}_{n^{g},j}, X^{g}_{n^{g},k}\right) + v_{j}\left( X^{g}_{n^{g},j}\right)\ddot{\psi}\left( X^{g}_{1,j}, X^{g}_{n^{g},k}\right) \right\} \times \phi\left( W^{g}_{n^{g}}\right) \end{pmatrix}, \\ \\ &&\!\!\!\!\!\mathcal{U}^{g}_{j,2} = \begin{pmatrix} v_{j}\left( X_{1,j}^{g}\right) \ddot{\zeta}\left( X^{g}_{1,j}, \phi_{1}({W^{g}_{1}})\right) & {\cdots} & v_{j}\left( X_{1,j}^{g}\right) \ddot{\zeta}\left( X^{g}_{1,j}, \phi_{d}({W^{g}_{1}})\right) \\ {\vdots} & {\ddots} & \vdots \\ v_{j}\left( X_{n^{g},j}^{g}\right) \ddot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{1}(W^{g}_{n^{g}})\right) & {\cdots} & v_{j}\left( X_{n^{g},j}^{g}\right) \ddot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{d}(W^{g}_{n^{g}})\right) \end{pmatrix} \\\\ &&\quad\quad\quad +\begin{pmatrix} \dot{v}_{j}\left( X_{1,j}^{g}\right) \dot{\zeta}\left( X^{g}_{1,j}, \phi_{1}({W^{g}_{1}})\right) & {\cdots} & \dot{v}_{j}\left( X_{1,j}^{g}\right) \dot{\zeta}\left( X^{g}_{1,j}, \phi_{d}({W^{g}_{1}})\right) \\ {\vdots} & {\ddots} & \vdots \\\ \dot{v}_{j}\left( X_{n^{g},j}^{g}\right) \dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{1}(W^{g}_{n^{g}})\right) & {\cdots} & \dot{v}_{j}\left( X_{n^{g},j}^{g}\right) \dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{d}(W^{g}_{n^{g}})\right)\!\! \end{pmatrix}\!, \\ \\ &&\!\!\!\!\!\mathcal{V}^{g}_{j,1} = \begin{pmatrix} \mathcal{V}^{g}_{j,1,1} \\ {\vdots} \\ \mathcal{V}^{g}_{j,p,1} \end{pmatrix}; \quad \mathcal{U}^{g}_{j,1} = \begin{pmatrix} \mathcal{U}^{g}_{1,j,1} \\ {\vdots} \\ \mathcal{U}^{g}_{j,p,1} \end{pmatrix}. \end{array} $$
Let \(\boldsymbol {\alpha }_{j} = \left (\alpha _{j,1}^{\top }, \ldots ,\alpha _{j,p}^{\top }\right )^{\top }\) for
and 𝜃j = (𝜃j,1,…,𝜃j,d)⊤ for
. We can express the empirical score matching loss Eq. 23 as
$$ L^{g}_{n,j}(\boldsymbol{\alpha}_{j}, \boldsymbol{\theta}_{j}) = \left( 2n^{g}\right)^{-1} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j} + \mathcal{V}^{g}_{2,j} \boldsymbol{\theta}_{j} \right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}+ \mathcal{V}^{g}_{2,j} \boldsymbol{\theta}_{j} \right) + \left( n^{g}\right)^{-1}\mathbf{1}^{\top} \left( \mathcal{U}^{g}_{1,j} \boldsymbol{\alpha}_{j} + \mathcal{U}^{g}_{2,j} \boldsymbol{\theta}_{j} \right). $$
We write the gradient of the risk function as
$$ \nabla L^{g}_{n,j}(\boldsymbol{\alpha}_{j}, \boldsymbol{\theta}_{j}) = \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \begin{pmatrix} \boldsymbol{\alpha}_{j} \\ \boldsymbol{\theta}_{j} \end{pmatrix} + \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1} \\ \left( \mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1} \end{pmatrix}. $$
Thus, the minimizer \((\hat {\boldsymbol {\alpha }}^{g}_{j}, \hat {\boldsymbol {\theta }}^{g}_{j})\) of the empirical loss takes the form
$$ \begin{pmatrix} \hat{\boldsymbol{\alpha}}^{g}_{j} \\ \hat{\boldsymbol{\theta}}^{g}_{j} \end{pmatrix} = - \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}^{-1} \begin{pmatrix} \left( \mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1} \\ \left( \mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1} \end{pmatrix}. $$
By applying Theorem 5.23 of van der Vaart (2000),
$$ \left( n^{g}\right)^{1/2} \begin{pmatrix} \hat{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \hat{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} \to_{d} N\left( 0, \begin{pmatrix} A B A \end{pmatrix} \right), $$
where the matrices A and B are defined as
We estimate the variance of \((\hat {\boldsymbol {\alpha }}^{g}_{j}, \hat {\boldsymbol {\theta }}^{g}_{j})\) as \(\hat {{{\varOmega }}}^{g}_{j} = \left (n^{g}\right )^{-1}\hat {A} \hat {B} \hat {A}\), where
$$ \begin{array}{@{}rcl@{}} &&\hat{A} = n^{g} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}^{-1}, \\ &&\hat{B} = \left( n^{g}\right)^{-1}\hat{\xi}^{\top}\hat{\xi}, \quad \hat{\xi} = \begin{pmatrix} \text{diag}\left( \mathcal{V}_{j,1}^{g}\hat{\boldsymbol{\alpha}}^{g}_{j} + \mathcal{V}_{j,2}^{g} \hat{\boldsymbol{\theta}}^{g}_{j} \right)\mathcal{V}_{j,1}^{g} \\ \text{diag}\left( \mathcal{V}_{j,1}^{g}\hat{\boldsymbol{\alpha}}^{g}_{j} + \mathcal{V}_{j,2}^{g} \hat{\boldsymbol{\theta}}^{g}_{j} \right) \mathcal{V}_{j,2}^{g} \end{pmatrix} + \begin{pmatrix} \mathcal{U}_{j,1}^{g} \\ \mathcal{U}_{j,2}^{g} \end{pmatrix}. \end{array} $$
B.3 Consistency of Regularized Generalized Score Matching Estimator
In this subsection, we argue that the regularized generalized score matching estimators \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) and \(\tilde {\boldsymbol {\theta }}^{g}_{j}\) from Eq. 24 are consistent. Let \(\mathcal {P}_{j}(\boldsymbol {\alpha }_{j}) = {\sum }_{j=1}^{p} \|\alpha _{j,k}\|_{2}\). We establish convergence rates of \(\mathcal {P}_{j}\left (\tilde {\boldsymbol {\alpha }}_{j}^{g} - \boldsymbol {\alpha }_{j}^{g,*} \right )\) and \(\left \|\tilde {\boldsymbol {\theta }}^{g}_{j} - \boldsymbol {\theta }_{j}^{g,*} \right \|_{2}\). Our approach is based on proof techniques described in Bühlmann and van de Geer (2011).
Our result requires a notion of compatibility between the penalty function \(\mathcal {P}_{j}\) and the loss \(L^{g}_{n,j}\). Such notions are commonly assumed in the high-dimensional literature. Below, we define the compatibility condition.
Definition 1 (Compatibility Condition).
Let S be a set containing indices of the nonzero elements of \(\boldsymbol {\alpha }_{j}^{g,*}\), and let \(\bar {S}\) denote the complement of S. Let
be a (p − 1)d-dimensional vector where the r-th element is one if r ∈ S, and zero otherwise. The group LASSO compatibility condition holds for the index set S ⊂{1,…,p} and for constant C > 0 if for all
,
where ∘ is the element-wise product operator.
Theorem 2.
Let \(\mathcal {E}\) be the set
$$ \begin{array}{@{}rcl@{}} \mathcal{E} &=& \left\{ \max_{k \neq j} \left\{ \left\| \left( \mathcal{V}_{j,k,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \right\|_{2}\right\} \leq n^{g}\lambda_{0} \right\} \cap \\ &&\left\{ \left\| \left( \mathcal{V}_{j,k,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \right\|_{2} \leq n^{g}\lambda_{0} \right\} \end{array} $$
for some λ0 ≤ λ/2. Suppose the compatibility condition also holds. Then on the set \(\mathcal {E}\),
$$ \mathcal{P}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right) + \| \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \|_{2} \leq \frac{\lambda 4 |S|}{C^{2}} . $$
Proof Proof of Theorem 2.
The regularized score matching estimator \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\) necessarily satisfies the following basic inequality:
$$ L^{g}_{n,j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j}, \tilde{\boldsymbol{\theta}}^{g}_{j}\right) + \lambda\mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \leq L^{g}_{n,j}\left( \boldsymbol{\alpha}^{g,*}_{j}, \boldsymbol{\theta}^{g,*}_{j}\right) + \lambda\mathcal{P}_{j}\left( \boldsymbol{\alpha}^{g,*}_{j} \right). $$
With some algebra, this inequality can be rewritten as
$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\!\!\!\!\!\!\left( 2n^{g}\right)^{-1} \begin{pmatrix} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right)^{\top} & \left( \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j}\right)^{\top} \end{pmatrix} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}\\ &&\times\begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j} \end{pmatrix} + \lambda\mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \!\leq\! -\left( n^{g}\right)^{-1} \begin{pmatrix} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right)^{\top} & \left( \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j}\right)^{\top} \end{pmatrix}\\ &&\times\begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix}\ + \lambda\mathcal{P}_{j}\left( \boldsymbol{\alpha}^{g,*}_{j} \right). \end{array} $$
By Lemma 1, on the set \(\mathcal {E}\) and using λ ≥ λ0/2 we get
$$ \begin{array}{@{}rcl@{}} && \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right)^{\top} & \left( \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j}\right)^{\top} \end{pmatrix} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}\\&& \times\begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j} \end{pmatrix} + 2\lambda \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \leq \lambda\left\|\tilde{\boldsymbol{\theta}}_{j} - \boldsymbol{\theta}^{*}_{j} \right\|_{2} + 2\lambda \mathcal{P}_{j}\left( \boldsymbol{\alpha}^{g,*}_{j} \right) + \lambda\mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right). \end{array} $$
On the left hand side, we apply the triangle inequality to get
On the right hand side, we observe that
We then have
Now,
where we use the compatiblility condition for the first inequality, and for the second inequality use the fact that
$$ ab \leq b^{2} + a^{2} $$
for any
. The conclusion follows immediately. □
If the event \(\mathcal {E}\) occurs with probability tending to one, Theorem 2 implies
$$ \mathcal{P}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right) + \| \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \|_{2} = O_{P}\left( \lambda\right). $$
We select λ so that the event \(\mathcal {E}\) occurs with high probability. For instance, suppose the elements of the matrix
$$ \begin{array}{@{}rcl@{}} \xi = \begin{pmatrix} \text{diag}\left( \mathcal{V}_{j,1}^g\boldsymbol{\alpha}^{g,*}_j + \mathcal{V}_{j,2}^g \boldsymbol{\theta}^{g,*}_j \right)\mathcal{V}_{j,1}^g + \mathcal{U}_{j,1}^g \\ \text{diag}\left( \mathcal{V}_{j,1}^g\boldsymbol{alpha}^{g,*}_j + \mathcal{V}_{j,2}^g \boldsymbol{\theta}^{g,*}_j \right) \mathcal{V}_{j,2}^g + \mathcal{U}_{j,2}^g \end{pmatrix} \end{array} $$
are sub-Gaussian, and consider the event
$$ \begin{array}{@{}rcl@{}} \bar{\mathcal{E}} =&\left| \begin{pmatrix} \left( \mathcal{V}_{j,1}^g\right)^{\top} \left( \mathcal{V}_{j,1}^g \boldsymbol{\alpha}_j^{g,*} +\mathcal{V}_{j,2}^g\boldsymbol{\theta}_j^{g,*} \right) + \left( \mathcal{U}^g_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^g\right)^{\top} \left( \mathcal{V}_{j,1}^g \boldsymbol{\alpha}_j^{g,*} +\mathcal{V}_{j,2}^g\boldsymbol{\theta}_j^{g,*} \right) + \left( \mathcal{U}^g_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix} \right|_{\infty} \leq\frac{n^{g\lambda}_0}{d}, \end{array} $$
where \(\|\cdot \|_{\infty }\) is the \(\ell _{\infty }\) norm. Observing that \(\mathcal {E} \subset \bar {\mathcal {E}}\), it is only necessary to show that \(\bar {\mathcal {E}}\) holds with high probability. It is shown in Corollary 2 of Negahban et al. (2012) that there exist constants u1,u2 > 0 such that with \(\lambda _{0} \asymp \{\log (p)/n\}^{1/2}\), \(\bar {\mathcal {E}}\) holds with probability at least \(1 - u_{1}p^{-u_{2}}\). Thus, \(\mathcal {E}\) occurs with probability tending to one as \(p \to \infty \). For distributions with heavier tails, a larger choice of λ may be required (Yu et al. 2019).
B.4 De-biased Score Matching Estimator
The KKT conditions for the regularized score matching loss imply that the estimator \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) satisfies
$$ \begin{array}{@{}rcl@{}} \nabla L_{n,j}(\tilde{\boldsymbol{\alpha}}^{g}_{j}, \tilde{\boldsymbol{\theta}}^{g}_{j}) &=& \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}_{j}^{g} \\ \tilde{\boldsymbol{\theta}}_{j}^{g} \end{pmatrix}\\ &&+ \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1} \\ \left( \mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1} \end{pmatrix} = \begin{pmatrix} \lambda \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix}. \end{array} $$
With some algebra, we can rewrite the KKT conditions as
$$ \begin{array}{@{}rcl@{}} &&\left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} = \\ &&\lambda \begin{pmatrix} \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix} - \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix}. \end{array} $$
Now, let Σj,n be the matrix
$$ {{\varSigma}}_{j,n} = \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}, $$
let
, and let \(\tilde {M}_{j}\) be an estimate of \({{\varSigma }}_{j}^{-1}\). We can now rewrite the KKT conditions as
$$ \begin{array}{@{}rcl@{}} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} &=& \underset{(\mathrm{i})}{\underbrace{\lambda \tilde{M}_{j} \begin{pmatrix} \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix}}} - \underset{(\text{ii})}{\underbrace{\left( n^{g}\right)^{-1} \tilde{M}_{j} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix} }} + \\ &&\quad\quad\quad \underset{(\text{iii})}{ \underbrace{\left( n^{g}\right)^{-1} \left\{ I - {{\varSigma}}_{j,n} \tilde{M}_{j} \right\} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} }}. \end{array} $$
(B.1)
As is the case for the de-biased group LASSO in Appendix ??, the first term (i) in Eq. B.1 depends only on the observed data and can be directly subtracted from the initial estimate. The second term (ii) is asymptotically equivalent to
$$ \left( n^{g}\right)^{-1}{{\varSigma}}_{j}^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix}, $$
(B.2)
if \(\tilde {M}_{j}\) is a consistent estimate of \({{\varSigma }}_{j}^{-1}\). Using the fact that
, it can be seen that Eq. B.2 is an average of i.i.d. random quantities with mean zero. The central limit theorem then implies that any low-dimensional sub-vector is asymptotically normal. The last term (iii) is asymptotically negligible if \(\tilde {M}_{j}\) is an approximate inverse of Σj,n and if \((\tilde {\boldsymbol {\alpha }}_{j}^{g}, \tilde {\boldsymbol {\theta }}_{j}^{g})\) is consistent for \((\boldsymbol {\alpha }_{j}^{g,*}, \boldsymbol {\theta }_{j}^{g,*})\). Thus, for an appropriate choice of \(\tilde {M}_{j}\), we expect asymptotic normality of an estimator of the form
$$ \begin{pmatrix} \check{\boldsymbol{\alpha}}^{g}_{j} \\ \check{\boldsymbol{\theta}}^{g}_{j} \end{pmatrix} = \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} \end{pmatrix} - \lambda \tilde{M}_{j} \begin{pmatrix} \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix}. $$
Before constructing \(\tilde {M}_{j}\), we first provide an alternative expression for \({{\varSigma }}_{j}^{-1}\). We define the d × d matrices \({{\varGamma }}^{*}_{j,k,l}\) and \({{\varDelta }}^{*}_{j,k}\) as
We also define the d × d matrices \({{\varLambda }}^{*}_{j,k}\) as
Additionally, we define the d × d matrices \(C^{*}_{j,k}\) and \(D^{*}_{j}\)
It can be shown that \({{\varSigma }}_{j}^{-1}\) can be expressed as
$$ {{\varSigma}}^{-1}_{j} = \begin{pmatrix} \left( C^{*}_{j,1}\right)^{-1} & {\cdots} & \mathbf{0} & \mathbf{0} \\ {\vdots} & {\ddots} & {\vdots} & \vdots \\ \mathbf{0} & {\cdots} & \left( C^{*}_{j,p}\right)^{-1} & \mathbf{0} \\ \mathbf{0} & {\cdots} & \mathbf{0} & \left( D^{*}_{j}\right)^{-1} \end{pmatrix} \begin{pmatrix} I & -{{\varGamma}}^{*}_{j,1,2} & {\cdots} & -{{\varGamma}}^{*}_{j,1,p} & - {{\varDelta}}^{*}_{j,1} \\ -{{\varGamma}}^{*}_{j,2,1} & I & {\cdots} & -{{\varGamma}}^{*}_{j,2,p} & - {{\varDelta}}^{*}_{j,2} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & \vdots \\ -{{\varGamma}}^{*}_{j,p,1} & -{{\varGamma}}^{*}_{j,p,2} & {\cdots} & I & - {{\varDelta}}^{*}_{j,p} \\ -{{\varLambda}}^{*}_{j,1} & -{{\varLambda}}^{*}_{j,2} & {\cdots} & -{{\varLambda}}^{*}_{j,p} & I \end{pmatrix} . $$
We can thus estimate \({{\varSigma }}_{j}^{-1}\) by estimating each of the matrices \({{\varGamma }}^{*}_{j,k,l}\), \({{\varLambda }}^{*}_{j,k}\), and \({{\varDelta }}^{*}_{j,k}\).
Similar to our discussion of the de-biased group LASSO in Appendix ??, we use a group-penalized variant of the nodewise LASSO to construct \(\tilde {M}_{j}\). We estimate \({{\varGamma }}^{*}_{j,k,l}\) and \({{\varDelta }}^{*}_{j,k}\) as
where ω1,ω2 > 0 are tuning parameters, and ∥⋅∥2,∗ is as defined in Appendix ??. We estimate \({{\varLambda }}^{*}_{j,k}\) as
Additionally, we define the d × d matrices \(\tilde {C}_{j,k}\) and \(\tilde {D}_{j}\)
$$ \begin{array}{@{}rcl@{}} &&\tilde{C}_{j,k} = \left( n^{g}\right)^{-1}\left( \mathcal{V}^{g}_{j,k,1}\right)^{\top} \left( \mathcal{V}_{j,k,1}^{g} - {\sum}_{l \neq k,j} \mathcal{V}_{j,l,1}^{g} \tilde{{{\varGamma}}}_{j,k,l} - \mathcal{V}^{g}_{j,2}\tilde{{{\varDelta}}}_{j,k} \right) \\ &&\tilde{D}_{j} = \left( n^{g}\right)^{-1}\left( \mathcal{V}^{g}_{j,2}\right)^{\top} \left( \mathcal{V}_{j,2}^{g} - {\sum}_{k \neq j} \mathcal{V}_{j,k,1}^{g} \tilde{{{\varLambda}}}_{j,k} \right). \end{array} $$
We then take \(\tilde {M}_{j}\) as
$$ \tilde{M}_{j} = \begin{pmatrix} \tilde{C}^{-1}_{j,1} & {\cdots} & \mathbf{0} & \mathbf{0} \\ {\vdots} & {\ddots} & {\vdots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}^{-1}_{j,p} & \mathbf{0} \\ \mathbf{0} & {\cdots} & \mathbf{0} & \tilde{D}^{-1}_{j} \end{pmatrix} \begin{pmatrix} I & -\tilde{{{\varGamma}}}_{j,1,2} & {\cdots} & -\tilde{{{\varGamma}}}_{j,1,p} & - \tilde{{{\varDelta}}}_{j,1} \\ -\tilde{{{\varGamma}}}_{j,2,1} & I & {\cdots} & -\tilde{{{\varGamma}}}_{j,2,p} & - \tilde{{{\varDelta}}}_{j,2} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & \vdots \\ -\tilde{{{\varGamma}}}_{j,p,1} & -\tilde{{{\varGamma}}}_{j,p,2} & {\cdots} & I & - \tilde{{{\varDelta}}}_{j,p} \\ -\tilde{{{\varLambda}}}_{j,1} & -\tilde{{{\varLambda}}}_{j,2} & {\cdots} & -\tilde{{{\varLambda}}}_{j,p} & I \end{pmatrix} . $$
When \({{\varGamma }}^{*}_{j,k,l}\), \({{\varDelta }}^{*}_{j,k}\), and \({{\varLambda }}^{*}_{j,k}\) satisfy appropriate sparsity conditions and some additional regularity assumptions, \(\tilde {M}_{j}\) is a consistent estimate of \({{\varSigma }}_{j}^{-1}\) for \(\omega _{1} \asymp \{\log (p)/n\}^{1/2}\) and \(\omega _{2} \asymp \{\log (p)/n\}^{1/2}\) (see, e.g., Chapter 8 of Bühlmann and van de Geer (Bühlmann and van de Geer, 2011) for a more comprehensive discussion). Using the same argument presented in Appendix ??, we are able to obtain the following bound on a scaled version of the remainder term (iii):
$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\left\| \begin{pmatrix} \tilde{C}_{j,1} & {\cdots} & \mathbf{0} & \mathbf{0} \\ {\vdots} & {\ddots} & {\vdots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}_{j,p} & \mathbf{0} \\ \mathbf{0} & {\cdots} & \mathbf{0} & \tilde{D}_{j} \end{pmatrix} \left\{ I - \left( n^{g}\right)^{-1}\!\! \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \tilde{M}_{j} \right\} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} \right\|_{\infty} \leq \\ &&\max\{\omega_{1}, \omega_{2} \} \left\{ \mathcal{P}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right) + \| \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \|_{2} \right\}. \end{array} $$
The remainder is oP(n− 1/2) and hence asymptotically negligible if n1/2 \(\max \limits \{\omega _{1}, \omega _{2}\} \lambda \to 0\), where λ is the tuning parameter for the regularized score matching estimator (see Theorem 2).
The de-biased estimate \(\check {\alpha }^{g}_{j,k}\) of \(\alpha ^{g,*}_{j,k}\) can be expressed as
$$ \begin{array}{@{}rcl@{}} \check{\alpha}^{g}_{j,k} &=& \tilde{\alpha}^{g}_{j,k} - \left( n^{g}\right)^{-1} \tilde{C}^{-1}_{j,k} \left( \mathcal{V}^{g}_{j,k,1} - {\sum}_{l \neq j, k} \mathcal{V}_{j,l,1}^{g} \tilde{{{\varGamma}}}_{j,k,l} \right)^{\top} \\ &&\left( \mathcal{V}^{g}_{j,1} \tilde{\boldsymbol{\alpha}}^{g}_{j} + \mathcal{V}^{g}_{j,2} \tilde{\boldsymbol{\theta}}_{j}^{g} + \left( \mathcal{U}_{j,1}^{g}\right)^{\top} \mathbf{1} \right). \end{array} $$
(B.4)
The difference between the de-biased estimator \(\check {\alpha }^{g}_{j,k}\) and the true parameter \(\alpha ^{g,*}_{j,k}\) can be expressed as
$$ \begin{array}{@{}rcl@{}} \tilde{C}_{j,k}\left( \check{\alpha}^{g}_{j,k} - \alpha^{g,*}_{j,k}\right) &=&\!\!\!\! -\left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{j,k,1} - {\sum}_{l \neq j, k} \mathcal{V}_{j,l,1}^{g} {{\varGamma}}^{*}_{j,k,l} \right)^{\top} \left( \mathcal{V}^{g}_{j,1} \boldsymbol{\alpha}^{g,*}_{j} + \mathcal{V}^{g}_{j,2} \boldsymbol{\theta}_{j}^{g,*} + \left( \mathcal{U}_{j,1}^{g}\right)^{\top} \mathbf{1} \right) + \\ &&\!\!\!\!\left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{j,2} {{\varDelta}}^{*}_{j,k}\right)^{\top} \left( \mathcal{V}^{g}_{j,1} \boldsymbol{\alpha}^{g,*}_{j} + \mathcal{V}^{g}_{j,2} \boldsymbol{\theta}_{j}^{g,*} + \left( \mathcal{U}_{j,2}^{g}\right)^{\top} \mathbf{1} \right) \bigg\} + o_{P}\left( n^{-1/2}\right). \end{array} $$
As discussed above, the central limit theorem implies asymptotic normality of \(\check {\alpha }^{g}_{j,k}\). We can estimate the asymptotic variance of \(\check {\alpha }^{g}_{j,k}\) as
$$ \left( n^{g}\right)^{-2}\tilde{C}_{j,k}^{-1}\tilde{M}_{j,k}\tilde{\xi}^{\top}\tilde{\xi}\tilde{M}^{\top}_{j,k} \left( \tilde{C}_{j,k}^{-1}\right)^{\top}, $$
where we define
$$ \begin{array}{@{}rcl@{}} \tilde{\xi} &=& \begin{pmatrix} \text{diag}\left( \mathcal{V}_{j,1}^{g} \tilde{\boldsymbol{\alpha}}_{j}^{g} + \mathcal{V}_{j,2}^{g}\tilde{\boldsymbol{\theta}}_{j}^{g} \right)\mathcal{V}_{j,1}^{g} + \mathcal{U}^{g}_{j,1} \\ \text{diag}\left( \mathcal{V}_{j,1}^{g} \tilde{\boldsymbol{\alpha}}_{j}^{g} + \mathcal{V}_{j,2}^{g}\tilde{\boldsymbol{\theta}}_{j}^{g} \right)\mathcal{V}_{j,2}^{g} + \mathcal{U}^{g}_{j,2} \end{pmatrix} \\ \tilde{M}_{j,k} &=& \begin{pmatrix} -\tilde{{{\varGamma}}}_{j,k,1} & {\cdots} & -\tilde{{{\varGamma}}}_{j,k,k-1} & I & -\tilde{{{\varGamma}}}_{j,k,k+1} & {\cdots} & -\tilde{{{\varGamma}}}_{j,k,p} & - \tilde{{{\varDelta}}}_{j,p} \end{pmatrix}. \end{array} $$