Skip to main content

Covariate-Adjusted Inference for Differential Analysis of High-Dimensional Networks

Abstract

Differences between biological networks corresponding to disease conditions can help delineate the underlying disease mechanisms. Existing methods for differential network analysis do not account for dependence of networks on covariates. As a result, these approaches may detect spurious differential connections induced by the effect of the covariates on both the disease condition and the network. To address this issue, we propose a general covariate-adjusted test for differential network analysis. Our method assesses differential network connectivity by testing the null hypothesis that the network is the same for individuals who have identical covariates and only differ in disease status. We show empirically in a simulation study that the covariate-adjusted test exhibits improved type-I error control compared with naïve hypothesis testing procedures that do not account for covariates. We additionally show that there are settings in which our proposed methodology provides improved power to detect differential connections. We illustrate our method by applying it to detect differences in breast cancer gene co-expression networks by subtype.

Introduction

Complex diseases are often associated with aberrations in biological networks, such as gene regulatory networks and brain functional or structural connectivity networks (Barabási et al. 2011). Performing differential network analysis, or identifying connections in biological networks that change with disease condition, can provide insights into the disease mechanisms and lead to the identification of network-based biomarkers (Ideker and Krogan, 2012; de la Fuente, 2010).

Probabilistic graphical models are commonly used to summarize the conditional independence structure of a set of nodes in a biological network. A common approach to differential network analysis is to first estimate the graph corresponding to each disease condition and then assess between- condition differences in the graph. For instance, when using Gaussian graphical models, one can learn the network by estimating the inverse covariance matrix using the graphical LASSO (Friedman et al. 2008); one can then identify changes in the inverse covariance matrix associated with disease condition (Zhao et al. 2014; Xia et al. 2015; He et al. 2019). Alternatively, the condition-specific networks can be estimated using neighborhood selection (Meinshausen and Bühlmann, 2006); in this approach, partial correlations among nodes are estimated by fitting a series of linear regressions in which one node is treated as the outcome, and the remaining nodes are treated as regressors. Changes in the network can then be delineated from differences in the regression coefficients by disease condition (Belilovsky et al. 2016; Xia et al. 2018). More generally, the condition-specific networks are often modeled using exponential family pairwise interaction models (Lin et al. 2016; Yang et al. 2015; Yu et al. 2019; Yu et al. 2020).

The approaches to differential network analysis described above may lead to the detection of between-group differences in biological networks that are not necessarily meaningful, in particular, when the condition-specific networks depend on covariates (e.g., age and sex). This is because between-group network differences can be induced by confounding variables, i.e., variables that are associated with both the within-group networks, and the disease condition. In such cases, the network differences by disease condition may only reflect the association between the confounding variable and the disease. It is therefore important to account for the relationship between covariates and biological networks when performing differential network analysis.

In this paper, we propose a two-sample test for differential network analysis that accounts for within-group dependence of the networks on covariates. More specifically, we propose to perform covariate-adjusted inference using a class of pairwise interaction models for the within-group networks. Our approach treats each condition-specific network as a function of the covariates. It then performs a hypothesis test for equivalence of these functions. To accommodate the high-dimensional setting, in which the number of nodes in the network is large relative to the number of samples collected, we propose to estimate the networks using a regularized estimator and to perform hypothesis testing using a bias-corrected version of the regularized estimate (van de Geer, 2016).

Our proposal is related to existing literature on modeling networks as functions of a small number of variables. For example, there are various proposals for estimating high-dimensional inverse covariance matrices, conditional upon continuous low-dimensional features (Zhou et al. 2010; Wang and Kolar, 2014). Also related are methods for regularized estimation of high-dimensional varying coefficient models, wherein the regression coefficients are functions of a small number of covariates (Wang and Xia, 2009). Our method is similar but places a particular emphasis on hypothesis testing in order to assess the statistical significance of observed changes in the network. Our approach lays the foundation for a general class of graphical models and is the first, to the best of our knowledge, to perform covariate-adjusted hypothesis tests for differential network analysis.

The rest of the paper is organized as follows. In Section 2, we begin with a broad overview of our proposed framework for covariate-adjusted differential network analysis in pairwise interaction exponential family models and introduce some working examples. In the following sections, we specialize our framework by considering two different approaches for estimation and inference: In Section 3, we describe a method that uses neighborhood selection (Meinshausen and Bühlmann, 2006; Chen et al. 2015; Yang et al. 2015), and in Section 4, we discuss an alternative estimation approach that utilizes the score matching framework of Hyvärinen (2005, 2007). We assess the performance of our proposed methodology on synthetic data in Section 5 and apply it to a breast cancer data set from The Cancer Genome Atlas (TCGA) (Weinstein et al. 2013) in Section 6. We conclude with a brief discussion in Section 7.

Overview of the Proposed Framework

Differential Network Analysis without Covariate Adjustment

To formalize our problem, we begin by introducing some notation. We compare networks between two groups, labeled by g ∈{I,II}. We obtain measurements of p variables \(X^{g} = \left ({X^{g}_{1}},\ldots ,{X_{p}^{g}}\right )^{\top }\), corresponding to nodes in a graphical model (Maathuis et al. 2018), on nI subjects in group I and nII subjects in group II. We define as the sample space of Xg. Let \(X^{g}_{i,j}\) denote the data for node j for subject i in group g, and let \(\mathbf {X}_{j}^{g} = (X^{g}_{1,j}, \ldots , X^{g}_{n^{g},j})^{\top }\) be an ng-dimensional vector of measurements on node j for group g.

Our objective is to determine whether the association between variables Xj and Xk, conditional upon all other variables, differs by group. Our approach is to specify a model for Xg such that the conditional dependence between any two nodes \({X^{g}_{j}}\) and \({X^{g}_{k}}\) can be represented by a single scalar parameter \(\beta ^{g,*}_{j,k}\). If the association between nodes j and k is the same in both groups I and II, \(\beta ^{\mathrm {I},*}_{j,k} = \beta ^{\text {II},*}_{j,k}\). Conversely, if \(\beta ^{\mathrm {I},*}_{j,k} \neq \beta ^{\text {II},*}_{j,k}\), we say nodes j and k are differentially connected. We assess for differential connectivity by performing a test of the null hypothesis

$$ H^{0}_{j,k}: \beta^{\mathrm{I},*}_{j,k} =\beta^{\text{II},*}_{j,k}. $$
(1)

We consider a general class of exponential family pairwise interaction models. For x = (x1,…,xp), we assume the density function for Xg takes the form

$$ f^{g,*}(x) = \exp\left( {\sum}_{j=1}^{p} \mu_{j}(x_{j}) + {\sum}_{j = 1}^{p} {\sum}_{k = 1}^{j} \beta^{g,*}_{j,k}\psi_{j,k}(x_{j}, x_{k}) - U\left( \boldsymbol{\beta}^{g,*} \right) \right), $$
(2)

where ψj,k and μj are fixed and known functions, βg,∗ is a p × p matrix with elements \(\beta ^{g,*}_{j,k}\), and U(βg,∗) is the log-partition function. The dependence between \({X^{g}_{j}}\) and \({X^{g}_{k}}\) is measured by \(\beta ^{g,*}_{j,k}\), and nodes j and k are conditionally independent in group g if and only if \(\beta ^{g,*}_{j,k} = 0\).

This class of exponential family distributions is rich and includes several models that have been studied previously in the graphical modeling literature. One such example is the Gaussian graphical model, perhaps the most widely-used graphical model for continuous data. For the density function for mean-centered Gaussian random vectors can be expressed as

$$ f^{g,*}(x) \propto \exp\left( -{\sum}_{j = 1}^{p} {\sum}_{k = 1}^{j} \beta^{g,*}_{j,k} x_{j}x_{k} \right), $$
(3)

and is thus a special case of Eq. 2 with ψj,k = −xjxk and μj = 0. The non-negative Gaussian density, which takes the form of Eq. 3 with the constraint that x takes values in , also belongs to the exponential family class. Another canonical example is the Ising model, commonly used for studying conditional dependencies among binary random variables. For x ∈{0,1}p, the density function for the Ising model can be expressed as

$$ f(x) \propto \exp\left( {\sum}_{j = 1}^{p} {\sum}_{k=1}^{j} \beta_{j,k} x_{j}x_{k}\right). $$

Additional examples include the Poisson model, the exponential graphical model, and conditionally-specified mixed graphical models (Yang et al. 2015; Chen et al. 2015).

When asymptotically normal estimates of \(\beta ^{\mathrm {I},*}_{j,k}\) and \(\beta ^{\text {II},*}_{j,k}\) are available, one can perform a calibrated test of \(H^{0}_{j,k}\) based on the difference between the estimates. In many cases, asymptotically normal estimates can be obtained using well-established methodology. For instance, when the log-partition function U(βg,∗) is available in closed form and is tractable, one can obtain estimates via (penalized) maximum likelihood. This is a standard approach in the Gaussian setting, in which case the log-partition function is easy to compute. However, this is not the case for other exponential family models. Likelihood-based estimation strategies are thus generally difficult to implement. In this paper, we consider two alternative strategies that have been proposed to overcome these computational challenges and are more broadly applicable.

The first approach we discuss is neighborhood selection (Chen et al. 2015; Meinshausen and Bühlmann, 2006; Yang et al. 2015). Consider a sub-class of exponential family graphical models for which the conditional density function for any node \({X^{g}_{j}}\) given the remaining nodes belongs to a univariate exponential family model. Because the log-partition function in univariate exponential family models is available in closed form, it is computationally feasible to estimate each conditional density function. By estimating the conditional density functions, one can identify the neighbors of nodes j, that is, the nodes upon which the conditional distribution depends. This approach was first proposed as an alternative to maximum likelihood estimation for estimating Gaussian graphical models (Meinshausen and Bühlmann, 2006). To describe our approach, we focus on the Gaussian case, though this approach is more widely applicable and can be used for modeling dependencies among, e.g., Poisson, binomial, and exponential random variables as well (Chen et al. 2015; Yang et al. 2015).

In Gaussian graphical models, the dependency of node j on all other nodes can be determined based on the linear model

(4)

The regression coefficients \(\beta ^{g,*}_{j,k}\) measure the strength of linear association between nodes j and k conditional upon all other nodes and are zero if and only if nodes j and k are conditionally independent; \(\beta ^{g,*}_{j,0}\) is an intercept term and is zero if all nodes are mean-centered. (We acknowledge a slight abuse of notation here, as the regression coefficients in Eq. 4 are not equivalent to parameters in Eq. 2. However, either estimand fully characterizes conditional independence). In the low-dimensional setting (i.e., png), statistically efficient and asymptotically normal estimates of the regression coefficients can be readily obtained via ordinary least squares. In high-dimensions (i.e., png), the ordinary least squares estimates are not well-defined, so to obtain consistent estimates we typically rely upon regularized estimators such as the LASSO and the elastic net (Tibshirani, 1996; Zou and Hastie, 2005). Regularized estimators are generally biased and have intractable sampling distributions, and as such, are unsuitable for performing formal statistical inference. However, several methods have recently emerged for obtaining asymptotically normal estimates by correcting the bias of regularized estimators (Javanmard and Montanari, 2014; van de Geer et al. 2014; Zhang and Zhang, 2014).

The second computationally efficient approach we consider is to estimate the density function using the score matching framework of Hyvärinen (2005, 2007). Hyvärinen derives a loss function for estimation of density functions for continuous random variables that is based on the gradient of the log-density with respect to the observations. As such, the score matching loss does not depend on the log-partition function in exponential family models. Moreover, when the joint distribution for Xg belongs to an exponential family model, the loss is quadratic in the unknown parameters, allowing for efficient computation. In low dimensions, the minimizer of the score matching loss is consistent and asymptotically normal. In high dimensions, one can obtain asymptotically normal estimates by minimizing a regularized version of the score matching loss to obtain an initial estimate (Lin et al. 2016; Yu et al. 2019) and subsequently correcting for the bias induced by regularization (Yu et al. 2020).

Covariate-Adjusted Differential Network Analysis

We now consider the setting in which the within-group networks depend on covariates. We denote by Wg a q-dimensional random vector of covariate measurements for group g, and we define \(\mathcal {W}\) as the sample space of Wg. Let \(W^{g}_{i,r}\) refer to the value of covariate r for subject i in group g, and let \({W^{g}_{i}} = (W^{g}_{i,1},\ldots ,W^{g}_{i,q})^{\top }\) be a q-dimensional vector containing all covariates for subject i in group g. We assume the number of covariates is small relative to the sample size (i.e., qng).

To study the dependence of the within-group networks on the covariates, we specify a model for the nodes Xg given the covariates Wg that allows for the inter-node dependencies to vary as a function of Wg. The model defines a function that takes as input a vector of covariates and returns a measure of association between nodes j and k for a subject in group g with identical covariates. One can interpret \(\eta ^{g,*}_{j,k}\) as a conditional version of \(\beta ^{g,*}_{j,k}\), given the covariates.

We assume that \(\eta ^{g,*}_{j,k}\) can be written as a low-dimensional linear basis expansion in Wg of dimension d — that is,

$$ \eta^{g,*}_{j,k}\left( W^{g}\right) = \left\langle \phi\left( W^{g}\right), \alpha_{j,k}^{g,*} \right\rangle, $$
(5)

where is a map from a set of covariates to its expansion, \(\alpha _{j,k}^{g,*}\) is a d-dimensional vector, and 〈⋅,⋅〉 denotes the vector inner product. Let ϕc(w) refer to the c-th element of ϕ(w). One can take the simple approach of specifying ϕ as a linear basis, \(\phi (w) = \left (1, w_{1}, \ldots , w_{q}\right )\) for , though more flexible choices such as polynomial or B-spline bases can also be considered. It may be preferable to specify ϕ so that \(\eta ^{g,*}_{j,k}\) is an additive function of the covariates. This allows one to easily assess the effect of any specific covariate on the network by estimating the sub-vector of \(\alpha ^{g,*}_{j,k}\) that is relevant to the covariate of interest.

When the association between nodes j and k does not depend on group membership, \(\eta ^{\mathrm {I,*}}_{j,k}(w) = \eta ^{\mathrm {II,*}}_{j,k}(w)\) for all w, and \(\alpha ^{\mathrm {I,*}}_{j,k} = \alpha ^{\mathrm {II,*}}_{j,k}\). In other words, if one subject from group I and another subject from group II have identically-valued covariates, the corresponding measure of association between nodes j and k is also the same. In the covariate-adjusted setting, we say that nodes j and k are differentially connected if there exists w such that \(\eta ^{\mathrm {I,*}}_{j,k}(w) \neq \eta ^{\mathrm {II,*}}_{j,k}(w)\), or equivalently, if \(\alpha ^{\mathrm {I,*}}_{j,k} \neq \alpha ^{\mathrm {II,*}}_{j,k}\). We can thus assess differential connectivity between nodes j and k by testing the null hypothesis

$$ G^{0}_{j,k}: \alpha^{\mathrm{I,*}}_{j,k} = \alpha^{\mathrm{II,*}}_{j,k}. $$
(6)

Similar to the unadjusted setting, when asymptotically normal estimates of \(\alpha ^{\mathrm {I},*}_{j,k}\) and \(\alpha ^{\text {II},*}_{j,k}\) are available, a calibrated test can be constructed based on the difference between the estimates.

We now specify a form for the conditional distribution of Xg given Wg as a generalization of the exponential family pairwise interaction model Eq. 2. We assume the conditional density for Xg given Wg can be expressed as

$$ f^{g,*}(x|w) \propto \exp\left( {\sum}_{j=1}^{p} \mu_{j}(x_{j}) + {\sum}_{j = 1}^{p} {\sum}_{k=1}^{j} \eta^{g,*}_{j,k}(w)\psi_{j,k}(x_{j}, x_{k}) + {\sum}_{j=1}^{p} {\sum}_{c=1}^{d} \theta_{j,c}^{g,*} \zeta_{j,c}\left( x_{j}, \phi_{c}(w)\right) \right), $$
(7)

where w = (w1,…,wq), and the proportionality is up to a normalizing constant that does not depend on x. Above, ζj,c is a fixed and known function, and the main effects of the covariates on Xg are represented by the scalar parameters \(\theta ^{g,*}_{j,c}\). The conditional dependence between nodes j and k, given all other nodes and given that Wg = w is quantified by \(\eta ^{g,*}_{j,k}(w)\), and \(\eta ^{g,*}_{j,k}(w) = 0\) if and only if nodes j and k are conditionally independent at w. One can thus view \(\eta ^{g,*}_{j,k}\) as a conditional version of \(\beta ^{g,*}_{j,k}\) in Eq. 7.

Either of the estimation strategies introduced in Section 2.1 can be used to perform covariate-adjusted inference. When the conditional distribution of each node given the remaining nodes and the covariates belongs to a univariate exponential family model, the covariate-dependent network can be estimated using neighborhood selection because the node conditional distributions can be estimated efficiently with likelihood-based methods. Alternatively, we can estimate the conditional density function in Eq. 7 using score matching.

As a working example, we again consider estimation of covariate-dependent Gaussian networks using neighborhood selection. Suppose the conditional distribution of Xg given Wg takes the form

$$ f^{g,*}(x|w) \propto \exp\left( -{\sum}_{j = 1}^{p} {\sum}_{k=1}^{j} \eta^{g,*}_{j,k}(w)x_{j}x_{k} - {\sum}_{j=1}^{p} {\sum}_{c=1}^{d} \theta_{j,c}^{g,*} x_{j} \phi_{c}(w) \right). $$
(8)

Then the dependencies of node j on all other nodes can be determined based on the following varying coefficient model (Hastie and Tibshirani, 1993):

(9)

The varying coefficient model is a generalization of the linear model that treats the regression coefficients as functions of the covariates. In Eq. 9, \(\eta ^{g,*}_{j,k}(w)\) returns a regression coefficient that quantifies the linear relationship between nodes j and k for subjects in group g with covariates equal to w. Then \({X^{g}_{j}}\) and \({X^{g}_{k}}\) are conditionally independent given all other nodes and given Wg = w if and only if \(\eta ^{g,*}_{j,k}(w) = 0\). The varying coefficients \(\eta ^{g,*}_{j,k}\) can thus be viewed as a conditional version of the regression coefficients in Eq. 4. (We have again abused the notation, as the varying coefficient functions in Eq. 9 are not equal to the parameters in Eq. 8, though both functions are zero for the same values of w). The intercept term \(\eta ^{g,*}_{j,0}\) accounts for the main effect of Wg on \({X^{g}_{j}}\). We can remove this main effect term by first centering the nodes \({X^{g}_{j}}\) about their conditional mean given Wg (which can be estimated by performing a linear regression of \({X^{g}_{j}}\) on ϕ(Wg)).

In Sections 3 and 4, we discuss construction of asymptotically normal estimators of \(\alpha ^{g,*}_{j,k}\) in the low- and high-dimensional settings using neighborhood selection and score matching. Before proceeding, we first examine the connection between the null hypotheses \(H^{0}_{j,k}\) and \(G^{0}_{j,k}\).

The Relationship between Hypotheses \(H^{0}_{j,k}\) and \(G^{0}_{j,k}\)

Hypotheses \(H^{0}_{j,k}\) in Eq. 1 and \(G^{0}_{j,k}\) in Eq. 6 are related but not equivalent. It is possible that \(H^{0}_{j,k}\) holds while \(G^{0}_{j,k}\) fails and vice versa. We provide an example below. Suppose we are using neighborhood selection to perform differential network analysis in the Gaussian setting, so we are making a comparison of linear regression coefficients between the two groups. Suppose further that the within-group networks depend on single scalar covariate Wg, and the nodes are centered about their conditional mean given Wg. One can show that the regression coefficients \(\beta ^{g,*}_{j,k}\) are equal to the average of their conditional versions \(\eta ^{g,*}_{j,k}(W^{g})\). That is, . Now, suppose \(G^{0}_{j,k}\) holds. If WI and WII do not share the same distribution (e.g., the covariate tends to take higher values in group I than in group II), the average conditional inter-node association may differ, and \(H^{0}_{j,k}\) may not hold. Although the conditional association between nodes, given the covariate, does not differ by group, the average conditional association does differ, as illustrated in Fig. 1a. In such a scenario, the difference in the average conditional association is induced by the dependence of the covariate on group membership and the dependence of the inter-node association on the covariate. Thus, inequality of \(\beta ^{\mathrm {I},*}_{j,k}\) and \(\beta ^{\text {II},*}_{j,k}\) does not necessarily capture a meaningful association between the network and group membership. Similarly when \(H^{0}_{j,k}\) holds, it is possible that \(\eta ^{\mathrm {I},*}_{j,k} \neq \eta ^{\text {II},*}_{j,k}\). For instance, suppose that the distribution of the covariate is the same in both groups, and in both groups. If the between-node association depends more strongly upon the covariates in one group than the other, \(G^{0}_{j,k}\) will be false. This example is depicted in Fig. 1b. In this scenario, adjusting for covariates should provide improved power to detect differential connections. We note that for other distributions, it does not necessarily hold that , but regardless, there is generally no equivalence between hypotheses \(H^{0}_{j,k}\) and \(G^{0}_{j,k}\).

Figure 1
figure 1

Displayed are the association between nodes j and k, \(\eta ^{g}_{j,k}(\cdot )\), as a function of covariate Wg and the distribution of Wg in groups I and II. The average inter-node association is represented by the dashed darkened lines. In (a), the average inter-node association depends on group membership, though the inter-node association given the covariate does not. In (b), the average inter-node association does not depend on group membership, though the conditional association between nodes given the covariate does depend on group membership

Covariate-Adjusted Differential Network Analysis Using Neighborhood Selection

In this section, we describe in detail an approach for covariate-adjusted differential network analysis using neighborhood selection. To simplify our presentation, we focus on Gaussian graphical models, though this strategy is generally applicable to graphical models for which the node conditional distributions belong to univariate exponential family models.

Covariate Adjustment via Neighborhood Selection in Low Dimensions

We first discuss testing the unadjusted null hypothesis \(H^{0}_{j,k}\) in Eq. 1, where the \(\beta ^{g,*}_{j,k}\) are the regression coefficients in Eq. 4. Suppose, for now, that we are in the low-dimensional setting, so the number of nodes p is smaller than the sample sizes ng, g ∈{I,II}.

It is well-known that the regression coefficients can be characterized as the minimizers of the expected least squares loss — that is,

One can obtain an estimate \(\hat {\boldsymbol {\beta }}_{j}^{g} = (\hat {\beta }^{g}_{j,1},\ldots ,\hat {\beta }^{g}_{j,p})\) of \(\boldsymbol {\beta }^{g,*}_{j} = (\beta ^{g,*}_{j,1}\ldots .,\beta ^{g,*}_{j,p})\) by minimizing the empirical average of the least squares loss, taking

where ∥⋅∥2 denotes the 2 norm. The ordinary least squares estimate \(\hat {\boldsymbol {\beta }}^{g}_{j}\) is available in closed form and is easy to compute. The estimates \(\hat {\beta }^{g}_{j,k}\) are unbiased, and, under mild assumptions, are approximately normally distributed for sufficiently large ng — that is,

$$ \hat{\beta}^{g}_{j,k} \sim N\left( \beta^{g,*}_{j,k}, \tau^{g}_{j,k} \right), $$

with \(\tau ^{g}_{j,k} > 0\) (though \(\tau ^{g}_{j,k}\) can be calculated in closed form, we omit the expression for brevity).

We construct a test of \(H_{j,k}^{0}\) based on the difference between the estimates of the group-specific regression coefficients, \(\hat {\beta }^{\mathrm {I}}_{j,k} - \hat {\beta }^{\text {II}}_{j,k}\). When \(H^{0}_{j,k}\) holds, \(\hat {\beta }^{\mathrm {I}}_{j,k} - \hat {\beta }^{\text {II}}_{j,k}\) is normally distributed with mean zero and variance \(\tau ^{\mathrm {I}}_{j,k} + \tau ^{\text {II}}_{j,k}\). Given a consistent estimate \(\hat {\tau }^{g}_{j,k}\) of the variance, we can use the test statistic

$$ T_{j,k} = \frac{\left( \hat{\beta}_{j,k}^{\mathrm{I}} - \hat{\beta}^{\text{II}}_{j,k}\right)^{2}}{\hat{\tau}^{\mathrm{I}}_{j,k} + \hat{\tau}^{\text{II}}_{j,k}}, $$

which follows a chi-square distribution with one degree of freedom under the null for nI and nII sufficiently large. A p-value for \(H^{0}_{j,k}\) can be calculated as

$$ \rho_{j,k} = P\left( {\chi^{2}_{1}} > T_{j,k} \right). $$

In the low-dimensional setting, performing a covariate-adjusted test is similar to performing the unadjusted test. We can obtain an estimate \(\hat {\boldsymbol {\alpha }}^{g}_{j} = \left ((\hat {\alpha }^{g}_{j,1})^{\top },\ldots ,(\hat {\alpha }^{g}_{j,p})^{\top }\right )^{\top }\) of \(\boldsymbol {\alpha }^{g,*}_{j} = \left ((\alpha ^{g,*}_{j,1})^{\top },\ldots ,(\alpha ^{g,*}_{j,p})^{\top }\right )^{\top }\) by minimizing the empirical average of the least squares loss

(10)

To simplify the presentation, we introduce additional notation that allows us to rewrite Eq. 10 in a condensed form. Let \(\mathcal {V}^{g}_{k}\) be the ng × d matrix

$$ \mathcal{V}^{g}_{k} = \begin{pmatrix} X_{1,k}^{g} \times \phi\left( {W_{1}^{g}}\right) \\ \vdots \\ X_{n^{g}, k}^{g} \times \phi\left( W^{g}_{n^{g}}\right) \end{pmatrix}. $$
(11)

We can now equivalently express Eq. 10 as

(12)

Again, \(\hat {\alpha }^{g}_{j,k}\) is unbiased and approximately normal for sufficiently large ng, satisfying

$$ \hat{\alpha}^{g}_{j,k} \sim N\left( \alpha^{g,*}_{j,k}, {{\varOmega}}^{g}_{j,k}\right), $$

where \({{\varOmega }}^{g}_{j,k}\) is a positive definite matrix of dimension d × d (though a closed form expression is available, we omit it here for brevity).

We construct a test of \(G^{0}_{j,k}\) based on \(\hat {\alpha }^{\mathrm {I}}_{j,k} - \hat {\alpha }^{\text {II}}_{j,k}\). Under the null hypothesis, \(\hat {\alpha }^{\mathrm {I}}_{j,k} - \hat {\alpha }^{\text {II}}_{j,k}\) follows a normal distribution with mean zero and variance \({{\varOmega }}^{\mathrm {I}}_{j,k} + {{\varOmega }}^{\text {II}}_{j,k}\). Given a consistent estimate \(\hat {{{\varOmega }}}^{g}_{j,k}\) of \({{\varOmega }}^{g}_{j,k}\), we can test \(G^{0}_{j,k}\) using the test statistic

$$ S_{j,k} = \left( \hat{\alpha}_{j,k}^{\mathrm{I}} - \hat{\alpha}_{j,k}^{\text{II}}\right)^{\top} \left( \hat{{{\varOmega}}}^{\mathrm{I}}_{j,k} + \hat{{{\varOmega}}}^{\text{II}}_{j,k} \right)^{-1} \left( \hat{\alpha}_{j,k}^{\mathrm{I}} - \hat{\alpha}_{j,k}^{\text{II}}\right). $$

Under the null, the test statistic follows a chi-squared distribution with d degrees of freedom, and a p-value can therefore be calculated as

$$ P\left( {\chi^{2}_{d}} > S_{j,k}\right). $$

Covariate Adjustment via Neighborhood Selection in High Dimensions

The methods described in Section 3.1 are only appropriate when the number of nodes p is small relative to the sample size. The model in Eq. 9 has (p − 1)d parameters, so the least squares estimator of Section 3.1 provides stable estimates as long as nI and nII are larger than (p − 1)d. However, in the high-dimensional setting, where the the number of parameters exceeds the sample size, the ordinary least squares estimates are not well-defined.

To fit the varying coefficient model Eq. 9 in the high-dimensional setting, we use a regularized estimator that relies upon an assumption of sparsity in the networks. The sparsity assumption requires that within each group only a small number of nodes are partially correlated, meaning that in Eq. 9, only a few of the vectors \(\alpha ^{g,*}_{j,k}\) are nonzero. To leverage the sparsity assumption, we propose to use the group LASSO estimator (Yuan and Lin, 2006):

(13)

where λ > 0 is a tuning parameter. The group LASSO provides a sparse estimate and sets some \(\tilde {\alpha }_{j,k}\) to be exactly zero, resulting in networks with few edges. The level of sparsity of \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) is determined by λ, with higher λ values forcing more \(\tilde {\alpha }_{j,k}\) to zero. We discuss selection of the tuning parameter in Section 5.1.

Though the group LASSO provides a consistent estimate of \(\boldsymbol {\alpha }^{g,*}_{j}\), the estimate is not approximately normally distributed. The group LASSO estimate of \({\alpha }^{g,*}_{j,k}\) retains a bias that diminishes at the same rate as the standard error. As a result, the group LASSO estimator has a non-standard sampling distribution that cannot be derived analytically and is therefore unsuitable for hypothesis testing.

We can obtain approximately normal estimates of \(\alpha ^{g,*}_{j,k}\) by correcting the bias of \(\tilde {\alpha }^{g}_{j,k}\), as was first proposed to obtain normal estimates for the classical 1-penalized version of the LASSO (van de Geer et al. 2014; Zhang and Zhang, 2014). These “de-biased” or “de-sparsified” estimators can been shown to be approximately normal with moderately large samples even in the high-dimensional setting; they are therefore suitable for hypothesis testing. Our approach is to use a de-biased version of the group LASSO. Bias correction in group LASSO problems is well studied (van de Geer, 2016; Honda, 2019; Mitra and Zhang, 2016), so we are able to perform covariate-adjusted inference by applying previously-developed methods.

The bias of the group LASSO estimate can be written as

(14)

where \(\delta ^{g}_{j,k}\) is a nonzero d-dimensional vector (recall d is the dimension of \(\alpha ^{g,*}_{j,k}\)). Our approach is to obtain an estimate of the bias \(\tilde {\delta }_{j,k}\) and to use a de-biased estimator, defined as

$$ \check{\alpha}^{g}_{j,k} = \tilde{\alpha}^{g}_{j,k} - \tilde{\delta}^{g}_{j,k}. $$
(15)

For a suitable choice of \(\tilde {\delta }_{j,k}\), the bias-corrected estimator is approximately normal for a sufficiently large sample size ng under mild conditions, i.e.,

$$ \check{\alpha}^{g}_{j,k} \sim N\left( \alpha^{g,*}_{j,k}, {{\varOmega}}^{g}_{j,k}\right), $$
(16)

where the variance \({{\varOmega }}^{g}_{j,k}\) is a positive definite matrix, for which we obtain an estimate \(\check {{{\varOmega }}}^{g}_{j,k}\). We provide a derivation for the bias-correction and the form of our variance estimate in Appendix ??.

Similar to Section 3.1, we test the null hypothesis \(G^{0}_{j,k}\) in Eq. 6 using the test statistic

$$ S_{j,k} = \left( \check{\alpha}_{j,k}^{\mathrm{I}} - \check{\alpha}_{j,k}^{\text{II}}\right)^{\top} \left( \check{{{\varOmega}}}_{j,k}^{\mathrm{I}} + \check{{{\varOmega}}}_{j,k}^{\text{II}} \right)^{-1} \left( \check{\alpha}_{j,k}^{\mathrm{I}} - \check{\alpha}_{j,k}^{\text{II}}\right). $$
(17)

The test statistic asymptotically follows a chi-squared distribution with d degrees of freedom under the null hypothesis.

Covariate-Adjusted Differential Network Analysis Using Score Matching

In this section, we discuss covariate-adjustment using the score matching framework introduced in Section 2. We first describe the score matching estimator in greater detail and then specialize the framework to estimation of pairwise exponential family graphical models in the low- and high dimensional settings. As shown later in this section, for exponential family distributions with continuous support, the score matching loss function is a quadratic function of parameters, providing a computationally-efficient framework for estimating graphical models.

The Score Matching Framework

We begin by providing a brief summary of the score matching framework (Hyvärinen, 2005; 2007). Let be a random vector generated from a distribution with density function h. For any candidate density h, we denote the gradient and Laplician of the log-density by

The score matching loss L is defined as a measure of divergence between a candidate density function h and the true density h:

(18)

It is apparent that the score matching loss is minimized when h = h. A natural approach to constructing an estimator for h would then be to minimize the empirical score matching loss given observations Z1,…,Zn, defined as

$$ L_{n}(h) = \frac{1}{n} {\sum}_{i=1}^{n} \left \| \nabla \log h\left( Z_{i}\right) - \nabla \log h^{*}\left( Z_{i}\right) \right \|_{2}^{2}. $$

Because the score matching loss function takes as input the gradient of the log density function, the loss does not depend on the normalizing constant. This makes score matching appealing when the normalizing constant is intractable.

The empirical loss seemingly depends on prior knowledge of h. However, if h(z) and ∥∇h(z)∥2 both tend to zero as z approaches the boundary of \(\mathcal {Z}\), a partial integration argument can be used to show that the score matching loss can be expressed as

$$ L(h) = \int \left\{ {{\varDelta}} \log h(z) + \frac{1}{2}\left \| \nabla \log h(z) \right \|_{2}^{2} \right\} h^{*}(z)dz + \text{const.}, $$
(19)

where ‘const.’ is a term that does not depend on h. We can therefore estimate h by minimizing an empirical version of the score matching loss that does not depend on h. We can express the empirical loss as

$$ L_{n}(h) = \frac{1}{n} {\sum}_{i=1}^{n} {{\varDelta}} \log h(Z_{i}) + \frac{1}{2}\left \| \nabla \log h(Z_{i}) \right \|_{2}^{2}. $$

The score matching loss is particularly appealing for exponential family distributions with continuous support, as it leads to a quadratic optimization function (Lin et al. 2016). However, when Z is non-negative, the arguments used to express Eq. 18 as Eq. 19 fail because h(z) and ∥∇h(z)∥2 do not approach zero at the boundary. We can overcome this problem by instead considering the generalized score matching framework (Yu et al. 2019; Hyvärinen, 2007) as an extension that is suitable for non-negative data. Let be positive and differentiable functions, let \(v(z) = \left (v_{1}(z_{1}),\ldots ,v_{p}(z_{p})\right )^{\top }\), let \(\dot {v}_{j}\) denote the derivative of vj, and let ∘ denote the element-wise product operator. The generalized score matching loss is defined as

$$ L(h) = \int \left\|\left\{\nabla \log h(z) - \nabla \log h^{*}(z) \right\} \circ v^{1/2}(z) \right\|_{2}^{2} h^{*}(z) dz, $$
(20)

and is also minimized when h = h. As for the original score matching loss in Eq. 18, the generalized score matching loss seemingly depends on prior knowledge of h. However, under mild technical conditions on h and v (see Appendix ??), the loss in Eq. 20 can be rewritten as

$$ L(h) = \int\bigg[ {\sum}_{j=1}^{p} \dot{v}_{j}(z_{j})\left\{\frac{\partial \log h(z_{j})}{\partial z_{j}} \right\} + v_{j}(z_{j})\left\{\frac{\partial^{2} \log h(z)}{\partial {z^{2}_{j}}} \right\} + \frac{1}{2}v_{j}(z_{j})\left\{\frac{\partial \log h(z)}{\partial z_{j}} \right\}^{2} \bigg]h^{*}(z)dz. $$
(21)

The generalized score matching loss thus no longer depends on h, and an estimator can be constructed by minimizing the empirical version of Eq. 21 with respect to h. To this end, the original generalized score matching estimator considered \(v_{j}(z_{j}) = {z_{j}^{2}}\) (Hyvärinen, 2007). In this case, it becomes necessary to estimate high moments, leading to poor performance of the estimator. It has been shown that by instead taking v as a slowly increasing function, such as \(v_{j}(z_{j}) = \log (1 + v_{j})\), one obtains improved theoretical results and better empirical performance (Yu et al. 2019).

Covariate Adjustment in High-Dimensional Exponential Family Models via Score Matching

In this sub-section, we discuss construction of asymptotically normal estimators for the parameters of the exponential family pairwise interaction model Eq. 7 using the generalized score matching framework. To simplify our presentation, we consider the setting in which we are only interested in studying the connectedness between one node \({X^{g}_{j}}\) and all other neighboring nodes in the network. To this end, it suffices to estimate the conditional density of \({X^{g}_{j}}\) given all other nodes and the covariates Wg. A similar approach to the one we describe below can also be used to estimate the entire joint density in Eq. 7. For simplicity, we assume that in Eq. 7, there exist functions ψ and ζ such that ψ = ψj,k for all (j,k) and ζj,c = ζ for all (j,c), and that μj = 0. For x = (x1,…,xp) and w = (w1,…,wq) the conditional density can thus be expressed as

$$ f_{j}^{g,*}(x_{j}|x_{1},\ldots,x_{p},w) \propto \exp\left( {\sum}_{j = 1}^{p} \left\langle \alpha^{g,*}_{j,k}, \phi\left( w\right) \right\rangle\psi(x_{j}, x_{k}) + {\sum}_{c=1}^{d} \theta_{j,c}^{g,*} \zeta\left( x_{j}, \phi_{c}(w)\right)\right), $$
(22)

where the density is up to a normalizing constant that does not depend on xj.

We first explicitly define the score matching loss for the conditional density function in Eq. 22. Let \(\boldsymbol {\alpha }^{g,*}_{j} = \left ((\alpha ^{g,*}_{j,1})^{\top }, {\ldots } ,(\alpha ^{g,*}_{j,p})^{\top }\right )^{\top }\), and similarly let \(\boldsymbol {\theta }^{g,*}_{j} = (\theta ^{g,*}_{j,1}, {\ldots } ,\theta ^{g,*}_{j,p})^{\top }\). Let \(\dot {\psi }\) and \(\ddot {\psi }\) denote the first and second derivatives of ψ with respect to xj, and similarly, let \(\dot {\zeta }\) and \(\ddot {\zeta }\) denote the first and second derivatives of ζ with respect to xj. We define a non-negative function , and let \(\dot {v}_{j}\) denote the first derivative of vj. Then for candidate parameters \(\boldsymbol {\alpha }_{j} = \left (\alpha ^{\top }_{j,1},\ldots ,\alpha ^{\top }_{j,p}\right )^{\top }\) and 𝜃j = (𝜃j,1,…,𝜃j,d), the empirical generalized score matching loss for the conditional density of \({X_{j}^{g}}\) given all other nodes and the covariates can be expressed as

$$ \begin{array}{@{}rcl@{}} L^{g}_{n,j}(\boldsymbol{\alpha}, \boldsymbol{\theta}) = \frac{1}{2n^{g}}&&{\sum}_{i=1}^{n^{g}} v_{j}\left( X_{i,j}^{g}\right)\bigg\{ {\sum}_{k = 1}^{p} \left\langle \alpha_{j,k}, \phi\left( {W^{g}_{i}}\right) \right\rangle\dot{\psi}\left( X^{g}_{i,j}, X^{g}_{i,k}\right) + {\sum}_{c=1}^{d} \theta_{j,c} \dot{\zeta}\left( X^{g}_{i,j}, \phi_{c}\left( {W^{g}_{i}}\right)\right) \bigg\}^{2} + \\ \frac{1}{n^{g}}&&{\sum}_{i=1}^{n^{g}} v_{j}\left( X^{g}_{i,j}\right)\bigg\{ {\sum}_{k = 1}^{p} \left\langle \alpha_{j,k}, \phi\left( {W^{g}_{i}}\right) \right\rangle\ddot{\psi}\left( X^{g}_{i,j}, X^{g}_{i,k}\right) + {\sum}_{c=1}^{d} \theta_{j,c} \ddot{\zeta}\left( X^{g}_{i,j}, \phi_{c}\left( {W^{g}_{i}}\right)\right) \bigg\} + \\ \frac{1}{n^{g}}&&{\sum}_{i=1}^{n^{g}} \dot{v}_{j}\left( X^{g}_{i,j}\right)\bigg\{ {\sum}_{k = 1}^{p} \left\langle \alpha_{j,k}, \phi\left( {W^{g}_{i}}\right) \right\rangle\dot{\psi}\left( X^{g}_{i,j}, X^{g}_{i,k}\right) + {\sum}_{c=1}^{d} \theta_{j,c} \dot{\zeta}\left( X^{g}_{i,j}, \phi_{c}\left( {W^{g}_{i}}\right)\right) \bigg\}.\\ \end{array} $$
(23)

The true parameters \(\boldsymbol {\alpha }^{g,*}_{j}\) and \(\boldsymbol {\theta }^{g,*}_{j}\) can characterized as the minimizers of the population score matching loss , as discussed in Section 4.1.

The loss function in Eq. 23 is quadratic in parameters \(\boldsymbol {\alpha }^{g}_{j}\) and \(\boldsymbol {\theta }_{j}^{g}\) and can thus be solved efficiently. When the sample size ng is much larger than the number of unknown parameters (p + 1)d, one can estimate \(\boldsymbol {\alpha }^{g,*}_{j}\) and \(\boldsymbol {\theta }_{j}^{g,*}\) by simply minimizing \(L^{g}_{n,j}\) with respect to the unknown parameters. Moreover, we can readily establish asymptotic normality of the parameter estimates using results from classical M-estimation theory (van der Vaart, 2000). To avoid including cumbersome notation, we reserve the details for Appendix ??.

When the sample size is smaller than the number of parameters, the minimizer of \(L^{g}_{n,j}\) is no longer well-defined. Similar to Section 3.2, we use regularization to obtain a consistent estimator in the high-dimensional setting. We define the 2-regularized generalized score matching estimator as

$$ \left( \tilde{\boldsymbol{\alpha}}^{g}_{j}, \tilde{\boldsymbol{\theta}}^{g}_{j} \right) = \underset{\boldsymbol{\alpha}_{j}, \boldsymbol{\theta}_{j}}{\text{arg min}} L^{g}_{n,j}(\boldsymbol{\alpha}_{j}, \boldsymbol{\theta}_{j}) + \lambda {\sum}_{j=1}^{p} \left\| \alpha_{j,k} \right\|_{2}, $$
(24)

where λ > 0 is a tuning parameter. Similar to the group LASSO estimator in Eq. 13, the regularization term in Eq. 24 induces sparsity in the estimate \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\) and sets some \(\tilde {\alpha }^{g}_{j,k}\) to be exactly zero. The tuning parameter controls the level of sparsity, where more vectors \(\tilde {\alpha }^{g}_{j,k}\) are zero for higher λ. In Appendix ??, we establish consistency of the regularized score matching estimator assuming sparsity of \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) and some additional regularity conditions.

As is the case for the group LASSO estimator, the regularized score matching estimator has an intractable limiting distribution because its bias and standard error diminish at the same rate. We can obtain an asymptotically normal estimate by subtracting from the initial estimate an estimate of the bias. In Appendix ??, we construct such a bias-corrected estimate \(\check {\alpha }^{g}_{j,k}\) that, for sufficiently large ng, satisfies

$$ \check{\alpha}^{g}_{j,k} \sim N\left( \alpha^{g,*}_{j,k}, {{\varOmega}}^{g}_{j,k} \right), $$

for a positive definite matrix \({{\varOmega }}^{g}_{j,k}\). Given bias-corrected estimates and a consistent estimate \(\check {{{\varOmega }}}^{g}_{j,k}\) of \({{\varOmega }}^{g}_{j,k}\), we can test the null hypothesis in Eq. 6 using the test statistic

$$ S_{j,k} = \left( \check{\alpha}^{\mathrm{I}}_{j,k} - \check{\alpha}^{\text{II}}_{j,k} \right)^{\top} \left( \check{{{\varOmega}}}^{\mathrm{I}}_{j,k} + \check{{{\varOmega}}}^{\text{II}}_{j,k} \right)^{-1}\left( \check{\alpha}^{\mathrm{I}}_{j,k} - \check{\alpha}^{\text{II}}_{j,k} \right). $$

Under the null hypothesis, the test statistic follows a chi-squared distribution with d degrees of freedom.

Numerical Studies

In this section, we examine the performance of our proposed test in a simulation study. We consider the neighborhood selection approach described in Section 3. Our simulation study has three objectives: (1) to assess the stability of our estimators for the covariate-dependent networks, (2) to examine the effect of sample size on statistical power and type-I error control, and (3) to illustrate that failing to adjust for covariates can in some settings result in poor type-I error control or reduced statistical power.

Implementation

We first discuss implementation of the neighborhood selection approach. The group LASSO estimate in Eq. 13 does not exist in closed form, in contrast to the ordinary least squares estimate in Eq. 12. To solve Eq. 13, we use the efficient algorithm implemented in the publicly available R package gglasso (Yang and Zou, 2015).

The group LASSO estimator requires selection of a tuning parameter λ, which controls the sparsity of the estimate. We select the tuning parameter by performing K-fold cross-validation, using K = 10 folds. Since the selection of λ is sensitive to the scale of the columns of \(\mathcal {V}_{k}^{g}\) in Eq. 11, we scale the columns by their standard deviations prior to cross-validating. After fitting the group LASSO with the selected tuning parameter, we convert the estimates back to their original scale by dividing the estimates by the standard deviations of the columns of \(\mathcal {V}_{k}^{g}\).

Simulation Setting

In what follows, we describe our simulation setting. In short, we generate data from the varying coefficient model in Eq. 9, where we treat nodes 1 through (p − 1) as predictors, and treat node p as the response. We first randomly generate data for nodes 1 through (p − 1) in groups I and II from the same multivariate normal distribution. We then construct \(\eta ^{g,*}_{j,k}\) and generate data for two covariates \({W_{i}^{g}} = (W_{i,1}^{g}, W_{i,2}^{g})^{\top }\) so that one covariate acts as a confounding variable, and the other covariate should improve statistical power to detect differential associations after adjustment.

To simulate data for nodes 1 through (p − 1), we first generate a random graph with (p − 1) nodes and an edge density of .05 from a power law distribution with power parameter 5 (Newman, 2003). Denoting the edge set of the graph by E, we generate the (p − 1) × (p − 1) matrix Θ as

$$ {{\varTheta}}_{j,k} = \begin{cases} 0 & (j,k) \notin E \\ .5 & (j,k) \in E \text{ with 50\% probability} \\ -.5 & (j,k) \in E \text{ with 50\% probability} \end{cases}, $$

with Θj,k = Θk,j. Defining by a the smallest eigenvalue of Θ, we set Σ = (Θ − (a− .1)I)− 1, where I is the identity matrix. We then draw \((X_{i,1}^{g},\ldots ,X_{i,p-1}^{g})^{\top }\) from a multivariate normal distribution with mean zero and covariance Σ for i = 1,…,ng for each group g.

We generate \(W^{\mathrm {I}}_{i,1}\) from a Beta(3/2,1) distribution and \(W^{\text {II}}_{i,1}\) from a Beta(1,3/2) distribution. We center and scale both \(W^{\mathrm {I}}_{i,1}\) and \(W^{\text {II}}_{i,1}\) to the (− 1,1) interval. We generate \(W^{\mathrm {I}}_{i,2}\) and \(W_{i,2}^{\text {II}}\) each from a Uniform(− 1,1) distribution.

We consider two different choices for the varying coefficient functions \(\eta ^{g,*}_{j,k}\):

  • Linear Polynomial:

    $$ \begin{array}{@{}rcl@{}} \eta^{\mathrm{I,*}}_{p,1}(w_{1}, w_{2}) = .5 + .5w_{1}; &&\eta^{\mathrm{II,*}}_{p,1}(w_{1}, w_{2}) = .5 + .5w_{1} \\ \eta^{\mathrm{I,*}}_{p,2}(w_{1}, w_{2}) = .5 + .25w_{2}; &&\eta^{\mathrm{II,*}}_{p,2}(w_{1}, w_{2}) = .5 + .75w_{2} \\ \eta^{\mathrm{I,*}}_{p,3}(w_{1}, w_{2}) = 0; &&\eta^{\mathrm{II,*}}_{p,3}(w_{1}, w_{2}) = .5, \end{array} $$

    and \(\eta ^{g,*}_{p,k} = 0\) for k ≥ 4.

  • Cubic Polynomial:

    $$ \begin{array}{@{}rcl@{}} \eta^{\mathrm{I,*}}_{p,1}(w_{1}, w_{2}) = .5 + .5\left( w_{1} + {w_{1}^{2}} + {w_{1}^{3}}\right); &&\eta^{\mathrm{II,*}}_{p,1}(w_{1}, w_{2}) = .5 + .5\left( w_{1} + {w_{1}^{2}} + {w_{1}^{3}}\right) \\ \eta^{\mathrm{I,*}}_{p,2}(w_{1}, w_{2}) = .5 + .25\left( w_{2} + {w_{2}^{3}}\right); &&\eta^{\mathrm{II,*}}_{p,2}(w_{1}, w_{2}) = .5 + .75\left( w_{2} + {w_{2}^{3}}\right) \\ \eta^{\mathrm{I,*}}_{p,3}(w_{1}, w_{2}) = 0; &&\eta^{\mathrm{II,*}}_{p,3}(w_{1}, w_{2}) = .5, \end{array} $$

    and \(\eta ^{g,*}_{p,k} = 0\) for k ≥ 4.

The first covariate \(W_{i,1}^{g}\) confounds the association between nodes p and 1. The distribution of \(W_{i,1}^{g}\) depends on group membership, and \(W_{i,1}^{g}\) affects the association between nodes p and 1. However, \(\eta ^{\mathrm {I},*}_{p,1}(w) = \eta ^{\text {II},*}_{p,1}(w)\) for all w. Thus, \(G^{0}_{p,1}\) in Eq. 6 holds while \(H^{0}_{p,1}\) in Eq. 1 fails, as depicted in Fig. 1a. Failing to adjust for \({W_{1}^{g}}\) should therefore result in an inflated type-I error rate for the hypothesis \(G^{0}_{p,1}\). Adjusting for the second covariate \(W_{i,2}^{g}\) should improve the power to detect the differential connection between nodes p and 2. We have constructed \(\eta ^{g,*}_{p,2}\) so that , though the association between nodes p and 2 depends more strongly on Wg in group II than in group I. Thus, \(H^{0}_{p,2}\) holds while \(G^{0}_{p,2}\) fails, as depicted in Fig. 1b. The association between nodes p and 3 does not depend on either covariate, though the association differs by group. Thus, one should be able to identify a differential connection using either the adjusted or unadjusted test. Node p is conditionally independent of all other nodes in both groups.

For i = 1,…,ng, we generate \(X^{g}_{i,p}\) as

$$ X^{g}_{i,p} ={\sum}_{k \neq p} \eta^{g}_{j,k}\left( {W_{i}^{g}}\right) X^{g}_{i,k} + {\epsilon^{g}_{i}}, $$

where \({\epsilon ^{g}_{i}}\) follows a normal distribution with zero mean and unit variance. We use balanced sample sizes nI = nII = n and consider n ∈{80,160,240}. We set the number of nodes p = 40. The graph for nodes 1 through (p − 1) contains 15 edges. Leaving Σ fixed, we generate 400 random data sets following the above approach.

We consider two choices of the basis expansion ϕ:

  1. 1.

    Linear basis: \(\phi (w_{1}, w_{2}) = \begin {pmatrix} 1 & w_{1} & w_{2} \end {pmatrix}^{\top }\);

  2. 2.

    Cubic polynomial basis: \(\phi (w_{1}, w_{2}) = \begin {pmatrix} 1 & w_{1} & {w_{1}^{2}} & {w_{1}^{3}} & w_{2} & {w_{2}^{2}} & {w_{2}^{3}} \end {pmatrix}^{\top }\).

Using a linear basis, d = 3, and model in Eq. 9 has 117 parameters. With the cubic polynomial basis, d = 7, and there are 273 parameters.

We compare our proposed methodology with the approach for differential network analysis without covariate adjustment described in Section 3.1. In the unadjusted analysis, ordinary least squares estimation is justified because although (p − 1)d is large with respect to n, (p − 1) is smaller than n.

Simulation Results

Figure 2 shows the Monte Carlo estimates of the expected 2 error for the de-biased group LASSO estimates \(\check {\alpha }^{g}_{p,k}\), , for k = 1,…,(p − 1). We only report the 2 error when the basis ϕ is correctly specified for the varying coefficient function \(\eta ^{g,*}_{p,k}\) — that is, when ϕ is linear basis, and \(\eta ^{g,*}_{p,k}\) is a linear function or when ϕ is a cubic basis, and \(\eta ^{g,*}_{p,k}\) is a cubic function. In both the linear and cubic polynomial settings, the average 2 estimation error for \(\alpha ^{g,*}_{p,k}\) decreases with the sample size for all k, as expected. We also find that in small samples, the estimation error is substantially lower in the linear setting than in the cubic setting. This suggests that estimates are less stable in more complex models.

Figure 2
figure 2

Monte Carlo estimates of expected 2 error, , for k = 1,…,39. The linear polynomial plots display the 2 error when \(\eta ^{g,*}_{j,k}\) is a linear function, and ϕ is a linear basis. The cubic polynomial plots display the 2 error when \(\eta ^{g,*}_{j,k}\) is a cubic polynomial, and ϕ is a cubic basis

In Table 1, we report Monte Carlo estimates of the probability of rejecting \(G^{0}_{p,k}\), the null hypothesis that nodes p and k are not differentially connected given Wg, for k = 1, k = 2, k = 3, and k ≥ 4, using both the adjusted and unadjusted tests at the significance level κ = .05. As the purpose of the simulation study is to examine the behavior of the edge-wise test, we do not perform a multiple testing correction.

Table 1 Monte Carlo estimates of probability of rejecting \(G^{0}_{p,k}\), the null hypothesis that nodes p and k are not differentially connected, given Wg

For k = 1 (i.e., when \(H^{0}_{p,k}\) fails, but \(G^{0}_{p,k}\) holds), the unadjusted test is anti-conservative, and the probability of falsely rejecting \(G^{0}_{p,k}\) increases with the sample size. When an adjusted test is performed using a linear basis, and when \(\eta ^{g,*}_{p,1}\) is linear, the type-I error rate is slightly inflated but appears to approach the nominal level of .05 as the sample size increases. However, when \(\eta ^{g,*}_{p,1}\) is a cubic function, and the linear basis is mis-specified, the type-I error rate is inflated, though it is still slightly lower than that of unadjusted test. For both specifications of \(\eta ^{g,*}_{p,1}\), the covariate-adjusted test controls the type-I error rate near the nominal level when a cubic polynomial basis is used. For k = 2, (i.e., when \(H^{0}_{p,k}\) holds, but \(G^{0}_{p,k}\) fails), the unadjusted test exhibits low power to detect differential associations. The adjusted test provides greatly improved power when either a linear or cubic basis is used. For k = 3, (i.e., when both \(H^{0}_{p,k}\) and \(G^{0}_{p,k}\) fail), the unadjusted test and both adjusted tests are well-powered against the null. For k ≥ 4 (i.e., when nodes p and k are conditionally independent in both groups), the unadjusted test and the adjusted test with a linear basis both control the type-I error near the nominal level. However, the covariate-adjusted test is conservative when a cubic basis is used.

The simulation results corroborate our expectations and suggest that there are potential benefits to covariate adjustment. We find that when the sample size is large, the covariate-adjusted test behaves reasonably well with either choice of basis function. However, in small samples, the covariate- adjusted test is somewhat imprecise, and the type-I error rate can be slightly above or below the nominal level. Practitioners should therefore exercise caution when using our proposed methodology in very small samples.

Data Example

Breast cancer classification based on expression of estrogen receptor hormone (ER) is prognostic of clinical outcomes. Breast cancers can be classified as estrogen receptor positive (ER+) and estrogen receptor negative (ER-), with approximately 70% of breast cancers being ER+ (Lumachi et al. 2013). In ER+ breast cancer, the cancer cells require estrogen to grow; this has been shown to be associated with positive clinical outcomes, compared with ER- breast cancer (Carey et al. 2006). Identifying differences between the biological pathways of ER+ and ER- breast cancers can be helpful for understanding the underlying disease mechanisms.

It is has been shown that age is associated with ER status and that age can be associated with gene expression (Khan et al. 1998; Yang et al. 2015). This warrants consideration of age as an adjustment variable in a comparison of gene co-expression networks between ER groups.

We perform an age-adjusted differential analysis of the ER+ and ER- breast cancer networks, using publicly available data from The Cancer Genome Atlas (TCGA) (Weinstein et al. 2013). We obtain clinical measurements and gene expression data from a total of 806 ER+ patients and 237 ER- patients. We consider the set of p = 145 genes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) breast cancer pathway (Kanehisa and Goto, 2000), and adjust for age as our only covariate. The average age in the ER+ plus group is 59.3 years (SD = 13.3), and the average age in the ER- group is 55.9 years (SD = 12.4). We use a linear basis for covariate adjustment. In the ER+ group, the sample size is considerably larger than the number of the parameters, so we can fit the varying coefficient model in Eq. 9 using ordinary least squares. We use the de-biased group LASSO to estimate the network for the ER- group because the sample size is smaller than the number of model parameters. We compare the results from the covariate-adjusted analysis with the unadjusted approach described in Section 3.1.

To assess for differential connectivity between any two nodes j and k, we can either treat node j or node k as the response in the varying coefficient model in Eq. 9. We can then test either of the hypotheses \(G^{0}_{j,k}:\alpha ^{\mathrm {I},*}_{j,k} = \alpha ^{\text {II},*}_{j,k}\) or \(G^{0}_{k,j}:\alpha ^{\mathrm {I},*}_{k,j} = \alpha ^{\text {II},*}_{k,j}\). Our approach is to set our p-value for the test for differential connectivity between nodes j and k as the minimum of the p-values for the tests of \(G^{0}_{j,k}\) and \(G^{0}_{k,j}\), though we acknowledge that this strategy is anti-conservative.

Our objective is to identify all pairs of differentially connected genes, so we need to adjust for the fact that we perform a separate hypothesis test for each gene pair. We account for multiplicity by controlling the false discovery rate at the level κ = .05 using the Benjamini-Yekutieli method (2001).

The differential networks obtained from the unadjusted and adjusted analyses are substantially different. We report 106 differentially connected edges from the adjusted analysis (shown in Fig. 3), compared to only two such edges from the unadjusted analysis. This suggests it is possible that relationship between the gene co-expression network and age differs by ER group.

Figure 3
figure 3

Differential breast cancer network by estrogen receptor status from covariate-adjusted analysis. Nodes with at least five differentially connected neighbors are circled. The false discovery rate is controlled at .05

Discussion

In this paper, we have addressed challenges that arise when performing differential network analysis (Shojaie, 2020) in the setting where the network depends on covariates. Using both synthetic and real data, we showed that accounting for covariates can result in better control of type-I error and improved power.

We propose a parsimonious approach for covariate adjustment in differential network analysis. A number of improvements and extensions can be made to our current work. First, while this paper focuses on differential network analysis in exponential family models, our framework can be applied to other models where conditional dependence between any pair of nodes can be represented by a single scalar parameter. This includes semi-parametric models such as the nonparanormal model (Liu et al. 2009), as well as distributions defined over complex domains, which can be modeled using the generalized score matching framework (Yu et al. 2021). Additionally, we only discuss testing edge-wise differences between the networks, though testing differences between sub-networks may also be of interest. When the sub-networks are low-dimensional, one can construct a chi-squared test using similar test statistics as presented in Sections 3 and 4 because joint asymptotic normality of a low-dimensional set of the estimators \(\check {\alpha }^{g}_{j,k}\) can be readily established. Such an approach is not applicable to high-dimensional sub-networks, but it may be possible to construct a calibrated test using recent results on simultaneous inference in high-dimensional models (Zhang and Cheng, 2017; Yu et al. 2020). We can also improve the statistical efficiency of the network estimates by considering joint estimation procedures that borrow information across groups (Guo et al. 2011; Danaher et al. 2014; Saegusa and Shojaie, 2016). Finally, we assume that the relationship between the network and the covariates can be represented by a low-dimensional basis expansion. Investigating nonparametric approaches that relax this assumption can be a fruitful area of research.

Data Availability

This findings of this paper are supported by data from The Cancer Genome Atlas, which are accessible using the publicly available R package RTCGA.

Code availability

An implementation of the proposed methodology is available at https://github.com/awhudson/CovDNA.

References

  • Barabási, A.L., Gulbahce, N. and Loscalzo, J. (2011). Network medicine: A network-based approach to human disease. Nat. Rev. Genet. 12, 56–68.

    Google Scholar 

  • Belilovsky, E., Varoquaux, G. and Blaschko, M.B. (2016). Testing for differences in Gaussian graphical models: Applications to brain connectivity In: Advances in neural information processing systems, vol. 29. Curran Associates Inc,New York.

  • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals Stat. 1165–1188.

  • Breheny, P. and Huang, J. (2009). Penalized methods for bi-level variable selection. Stat. Interf. 2, 369.

    MathSciNet  MATH  Google Scholar 

  • Bühlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: Methods, theory and applications. Springer Science & Business Media, Berlin.

  • Carey, L.A., Perou, C.M., Livasy, C.A., Dressler, L.G., Cowan, D., Conway, K., Karaca, G., Troester, M.A., Tse, C.K., Edmiston, S. et al. (2006). Race, breast cancer subtypes, and survival in the carolina breast cancer study. J. Am. Med. Assoc. 295, 2492–2502.

    Google Scholar 

  • Chen, S., Witten, D.M. and Shojaie, A. (2015). Selection and estimation for mixed graphical models. Biometrika 102, 47–64.

    MathSciNet  MATH  Google Scholar 

  • Danaher, P., Wang, P. and Witten, D.M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Series B 76, 373–397.

    MathSciNet  Google Scholar 

  • Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.

    MATH  Google Scholar 

  • de la Fuente, A. (2010). From ‘differential expression’ to ‘differential networking’–identification of dysfunctional regulatory networks in diseases. Trends Genet. 26, 326–333.

    Google Scholar 

  • van de Geer, S. (2016). Estimation and testing under sparsity. Lect. Notes Math. 2159.

  • van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42, 1166–1202.

  • Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98, 1–15.

    MathSciNet  MATH  Google Scholar 

  • Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. J. R. Stat. Soc. Series B 55, 757–779.

    MathSciNet  MATH  Google Scholar 

  • He, H., Cao, S., Zhang, J.G., Shen, H., Wang, Y.P. and Deng, H. (2019). A statistical test for differential network analysis based on inference of Gaussian graphical model. Scientif. Rep. 9, 1–8.

    Google Scholar 

  • Honda, T. (2019). The de-biased group lasso estimation for varying coefficient models. Ann. Inst. Stat. Math. 1–27.

  • Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695–709.

    MathSciNet  MATH  Google Scholar 

  • Hyvärinen, A. (2007). Some extensions of score matching. Comput. Stat. Data Anal. 51, 2499–2512.

    MathSciNet  MATH  Google Scholar 

  • Ideker, T. and Krogan, N.J. (2012). Differential network biology. Molecular Systems Biology 8(1).

  • Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15, 2869–2909.

    MathSciNet  MATH  Google Scholar 

  • Kanehisa, M. and Goto, S. (2000). Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30.

    Google Scholar 

  • Khan, S.A., Rogers, M.A., Khurana, K.K., Meguid, M.M. and Numann, P.J. (1998). Estrogen receptor expression in benign breast epithelium and breast cancer risk. J. Natl. Cancer Inst. 90, 37–42.

    Google Scholar 

  • Lin, L., Drton, M. and Shojaie, A. (2016). Estimation of high-dimensional graphical models using regularized score matching. Electron. J. Stat. 10, 806–854.

    MathSciNet  MATH  Google Scholar 

  • Liu, H., Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res. 10, 2295–2328.

    MathSciNet  MATH  Google Scholar 

  • Lumachi, F., Brunello, A., Maruzzo, M., Basso, U. and Mm Basso, S. (2013). Treatment of estrogen receptor-positive breast cancer. Curr. Med. Chem. 20, 596–604.

    Google Scholar 

  • Maathuis, M., Drton, M., Lauritzen, S. and Wainwright, M. (2018). Handbook of graphical models. CRC Press, Boca Raton.

    MATH  Google Scholar 

  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462.

    MathSciNet  MATH  Google Scholar 

  • Mitra, R. and Zhang, C.H. (2016). The benefit of group sparsity in group inference with de-biased scaled group lasso. Electron. J. Stat. 10, 1829–1873.

    MathSciNet  MATH  Google Scholar 

  • Negahban, S.N., Ravikumar, P., Wainwright, M.J. and Yu, B. (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Stat. Sci. 27, 538–557.

    MathSciNet  MATH  Google Scholar 

  • Newman, M.E. (2003). The structure and function of complex networks. SIAM Rev. 45, 167–256.

    MathSciNet  MATH  Google Scholar 

  • Saegusa, T. and Shojaie, A. (2016). Joint estimation of precision matrices in heterogeneous populations. Electron. J. Stat. 10, 1341–1392.

    MathSciNet  MATH  Google Scholar 

  • Shojaie, A. (2020). Differential network analysis: A statistical perspective. Wiley Interdisciplinary Reviews: Computational Statistics e1508.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • van der Vaart, A.W. (2000). Asymptotic statistics, 3. Cambridge University Press, Cambridge.

    Google Scholar 

  • Wang, H. and Xia, Y. (2009). Shrinkage estimation of the varying coefficient model. J. Am. Stat. Assoc. 104, 747–757.

    MathSciNet  MATH  Google Scholar 

  • Wang, J. and Kolar, M. (2014). Inference for sparse conditional precision matrices. arXiv:1412.7638.

  • Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R.M., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C. and Stuart, J.M. (2013). The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120.

    Google Scholar 

  • Xia, Y., Cai, T. and Cai, T.T. (2015). Testing differential networks with applications to the detection of gene-gene interactions. Biometrika 102, 247–266.

    MathSciNet  MATH  Google Scholar 

  • Xia, Y., Cai, T. and Cai, T.T. (2018). Two-sample tests for high-dimensional linear regression with an application to detecting interactions. Stat. Sin. 28, 63–92.

    MathSciNet  MATH  Google Scholar 

  • Yang, E., Ravikumar, P., Allen, G.I. and Liu, Z. (2015). Graphical models via univariate exponential family distributions. J. Mach. Learn. Res. 16, 3813–3847.

    MathSciNet  MATH  Google Scholar 

  • Yang, J., Huang, T., Petralia, F., Long, Q., Zhang, B., Argmann, C., Zhao, Y., Mobbs, C.V., Schadt, E.E., Zhu, J. et al. (2015). Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Sci. Rep. 5, 1–16.

    Google Scholar 

  • Yang, Y. and Zou, H. (2015). A fast unified algorithm for solving group-lasso penalize learning problems. Stat. Comput. 25, 1129–1141.

    MathSciNet  MATH  Google Scholar 

  • Yu, M., Gupta, V. and Kolar, M. (2020). Simultaneous inference for pairwise graphical models with generalized score matching. J. Mach. Learn. Res. 21, 1–51.

    MathSciNet  MATH  Google Scholar 

  • Yu, S., Drton, M. and Shojaie, A. (2019). Generalized score matching for non-negative data. J. Mach. Learn. Res. 20, 1–70.

    MathSciNet  MATH  Google Scholar 

  • Yu, S., Drton, M. and Shojaie, A. (2021). Generalized score matching for general domains. Information and inference: A Journal of the IMA.

  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B 68, 49–67.

    MathSciNet  MATH  Google Scholar 

  • Zhang, C.H. and Zhang, S.S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Series B 76, 217–242.

    MathSciNet  MATH  Google Scholar 

  • Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models. J. Am. Stat. Assoc. 112, 757–768.

    MathSciNet  Google Scholar 

  • Zhao, S.D., Cai, T.T. and Li, H. (2014). Direct estimation of differential networks. Biometrika 101, 253–268.

    MathSciNet  MATH  Google Scholar 

  • Zhou, S., Lafferty, J. and Wasserman, L. (2010). Time varying undirected graphs. Mach. Learn. 80, 295–319.

    MathSciNet  MATH  Google Scholar 

  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B 67, 301–320.

    MathSciNet  MATH  Google Scholar 

Download references

Funding

The authors gratefully acknowledge the support of the NSF Graduate Research Fellowship Program under grant DGE-1762114 as well as NSF grant DMS-1561814 and NIH grant R01-GM114029. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aaron Hudson.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: De-biased Group LASSO Estimator

In this subsection, we derive a de-biased group LASSO estimator. Our construction is essentially the same as the one presented in van de Geer (2016).

With \(\mathcal {V}_{j}\) as defined in Eq. 11, let \(\mathcal {V}_{-j}^{g} = \left (\mathcal {V}^{g}_{1},\ldots ,\mathcal {V}^{g}_{j-1}, \mathcal {V}^{g}_{j+1},\ldots , \mathcal {V}^{g}_{p}\right )\) be an n × (p − 1)d dimensional matrix. For , let \(\boldsymbol {\alpha }_{j} = \left (\alpha _{j,1}^{\top }, \ldots , \alpha _{j,p}^{\top }\right )^{\top }\), let \(\mathcal {P}_{j}\left (\boldsymbol {\alpha }_{j} \right ) = {\sum }_{k \neq j} \left \| \alpha _{j,k} \right \|_{2}\), and let \(\nabla \mathcal {P}_{j}\) denote the sub-gradient of \(\mathcal {P}_{j}\). We can express the sub-gradient as \(\nabla \mathcal {P}_{j}(\boldsymbol {\alpha }_{j}) =\\ \left ((\nabla \|\alpha _{j,1}\|_{2})^{\top }, \cdots , (\nabla \|\alpha _{j,p}\|_{2})^{\top } \right )^{\top }\) where ∇∥αj,k2 = αj,k/∥αj,k2 if ∥αj,k2≠ 0, and ∇∥αj,k2 is otherwise a vector with 2 norm less than one. The KKT conditions for the group LASSO imply that the estimate \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) satisfies

$$ \left( n^{g}\right)^{-1}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) = -\lambda \nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right). $$

With some algebra, we can rewrite this as

$$ \left( n^{g}\right)^{-1}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \mathcal{V}_{-j}^{g} \left( \tilde{\boldsymbol{\alpha}}_{j}^{g}- \boldsymbol{\alpha}^{g,*}_{j}\right) = -\lambda \nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) + \left( \mathcal{V}^{g}_{-j}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \boldsymbol{\alpha}^{g,*}_{j} \right). $$

Let Σj be defined as the matrix

and let \(\tilde {M}_{j}\) be an estimate of \({{\varSigma }}_{j}^{-1}\). We can write \(\left (\tilde {\boldsymbol {\alpha }}^{g}_{j} - \tilde {\boldsymbol {\alpha }}^{g,*}_{j}\right )\) as

$$ \begin{array}{@{}rcl@{}} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \right) &= &\underset{\mathrm{(i)}}{\underbrace{-\lambda \tilde{M}_{j}\nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right)}} + \underset{\mathrm{(ii)}}{\underbrace{\left( n^{g}\right)^{-1}\tilde{M}_{j}\left( \mathcal{V}^{g}_{-j}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \boldsymbol{\alpha}^{g,*}_{j} \right)}} + \\ &&\underset{\mathrm{(iii)}}{\underbrace{\left\{I - \left( n^{g}\right)^{-1}\tilde{M}_{j}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \mathcal{V}_{-j}^{g} \right\} \left( \tilde{\boldsymbol{\alpha}}_{j}^{g} - \boldsymbol{\alpha}^{g,*}_{j}\right)}}. \end{array} $$
(A.1)

The first term (i) in Eq. A.1 is an approximation for the bias of the group LASSO estimate. This term is a function only of the observed data and not of any unknown quantities. This term can therefore be directly added to the initial estimate \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\). If \(\tilde {M}_{j}\) is a consistent estimate of \({{\varSigma }}_{j}^{-1}\), the second term (ii) is asymptotically equivalent to

$$ {{\varSigma}}^{-1}_{j} \left( \mathcal{V}^{g}_{-j}\right)^{\top} \left( \mathbf{X}_{j}^{g} - \mathcal{V}_{-j}^{g} \boldsymbol{\alpha}^{g,*}_{j} \right). $$

Thus, (ii) is asymptotically equivalent to a sample average of mean zero i.i.d. random variables. The central limit theorem can then be applied to establish convergence in distribution to the multivariate normal distribution at an n1/2 rate for any low-dimensional sub-vector. The third term will also be asymptotically negligible if \(\tilde {M}_{j}\) is an approximate inverse of \((n^{g})^{-1}\left (\mathcal {V}_{-j}^{g}\right )^{\top }\mathcal {V}^{g}_{-j}\). This would suggest that an estimator of the form

$$ \check{\boldsymbol{\alpha}}_{j}^{g} = \tilde{\boldsymbol{\alpha}}_{j}^{g} + \lambda \tilde{M}_{j} \nabla \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) $$

will be asymptotically normal for an appropriate choice of \(\tilde {M}_{j}\).

Before describing our construction of \(\tilde {M}_{j}\), we find it helpful to consider an alternative expression for \({{\varSigma }}^{-1}_{j}\). We define the d × d matrices \({{\varGamma }}^{*}_{j,k,l}\) as

(A.2)

We also define the d × d matrix \({C}^{*}_{j,k}\) as

It can be shown that \({{\varSigma }}^{-1}_{j}\) can be expressed as

$$ {{\varSigma}}^{-1}_{j} = \begin{pmatrix} \left( C_{j,1}^{*}\right)^{-1} & {\cdots} & \mathbf{0} \\ {\vdots} & {\ddots} & \vdots \\ \mathbf{0} & {\cdots} & \left( C_{j,p}^{*}\right)^{-1} \end{pmatrix} \begin{pmatrix} I & -{{\varGamma}}^{*}_{j,1,2} & {\cdots} & -{{\varGamma}}^{*}_{j,1,p} \\ -{{\varGamma}}^{*}_{j,2,1} & I & {\cdots} & -{{\varGamma}}^{*}_{j,2,p} \\ {\vdots} & {\vdots} & {\ddots} & \vdots \\ -{{\varGamma}}^{*}_{j,p,1} & -{{\varGamma}}^{*}_{j,p,2} & {\cdots} & I \end{pmatrix} . $$

We can thus estimate \({{\varSigma }}_{j}^{-1}\) by performing a series of regressions to estimate each matrix \({{\varGamma }}^{*}_{j,k,l}\).

Following the approach of van de Geer et al. (2014), we use a group LASSO variant of the nodewise LASSO to construct \(\tilde {M}_{j}\). To proceed, we require some additional notation. For any d × d matrix Γ = (γ1,…,γd) for d −dimensional vectors γc, let \(\|{{\varGamma }} \|_{2,*} = {\sum }_{c = 1}^{d} \|\gamma _{c}\|_{2}\). Let \(\nabla \| {{\varGamma }} \|_{2,*} = \left (\gamma _{1}/\|\gamma _{1}\|_{2},\ldots ,\gamma _{d}/ \|\gamma _{d}\|_{2} \right )\) be the subgradient of ∥Γ2,∗. We use the group LASSO to obtain estimates \(\tilde {{{\varGamma }}}_{j,k,l}\) of \({{\varGamma }}^{*}_{j,k,l}\):

(A.3)

We then estimate \(C^{*}_{j,k}\) as

$$ \tilde{C}_{j,k} = \left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq k,j} \mathcal{V}^{g}_{l} \tilde{{{\varGamma}}}_{j,k,l} \right)^{\top}\left( \mathcal{V}_{k}^{g}\right). $$

Our estimate \(\tilde {M}_{j}\) takes the form

$$ \tilde{M}_{j} = \begin{pmatrix} \tilde{C}^{-1}_{j,1} & {\cdots} & \mathbf{0} \\ {\vdots} & {\ddots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}^{-1}_{j,p} \end{pmatrix} \begin{pmatrix} I & -\tilde{{{\varGamma}}}_{j,1,2} & {\cdots} & -\tilde{{{\varGamma}}}_{j,1,p} \\ -\tilde{{{\varGamma}}}_{j,2,1} & I & {\cdots} & -\tilde{{{\varGamma}}}_{j,2,p} \\ {\vdots} & {\vdots} & {\ddots} & \vdots \\ -\tilde{{{\varGamma}}}_{j,p,1} & -\tilde{{{\varGamma}}}_{j,p,2} & {\cdots} & I \end{pmatrix} . $$

With this construction of \(\tilde {M}_{j}\), we can establish a bound on the remainder term (iii) in Eq. A.1. To show this, we make use of the following lemma, which states a special case of the dual norm inequality for the group LASSO norm \(\mathcal {P}_{j}\) (see, e.g., Chapter 6 of van de Geer (2016)).

Lemma 1.

Let a1,…,ap and b1,…,bp be d-dimensional vectors, and let \(\mathbf {a} = \left (a_{1}^{\top },\ldots ,a_{p}^{\top }\right )^{\top }\) and \(\mathbf {b} = \left (b_{1}^{\top },\dots ,b_{p}^{\top }\right )^{\top }\) be pd-dimensional vectors. Then

$$ \langle \mathbf{a}, \mathbf{b}\rangle \leq \left( {\sum}_{j=1}^{p} \|a_{j}\|_{2} \right) \max_{j} \left\| b_{j} \right\|_{2}. $$

The KKT conditions for Eq. A.3 imply that for all lj,k

$$ \left( n^{g}\right)^{-1}\left( \mathcal{V}^{g}_{l}\right)^{\top}\left( \mathcal{V}^{g}_{k} - {\sum}_{r \neq k,j} \mathcal{V}^{g}_{r} \tilde{{{\varGamma}}}_{j,k,r}\right) = -\omega \nabla \left\| \tilde{{{\varGamma}}}_{j,k,l} \right\|_{2,*}. $$
(A.4)

Lemma 1 and Eq. A.4 imply that

$$ \left\| \begin{pmatrix} \tilde{C}_{j,1} & {\cdots} & \mathbf{0} \\ {\vdots} & {\ddots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}_{j,p} \end{pmatrix} \left\{I - \left( n^{g}\right)^{-1}\tilde{M}_{j}\left( \mathcal{V}_{-j}^{g}\right)^{\top} \mathcal{V}_{-j}^{g} \right\} \left( \tilde{\boldsymbol{\alpha}}_{j}^{g} - \boldsymbol{\alpha}^{g,*}_{j}\right) \right\|_{\infty} \leq \omega \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}_{j}^{g} - \boldsymbol{\alpha}^{g,*}_{j}\right), $$

where \(\|\cdot \|_{\infty }\) is the \(\ell _{\infty }\) norm. With \(\omega \asymp \left \{\log (p)/n\right \}^{1/2}\), \(\tilde {M}_{j}\) can be shown to be consistent under sparsity of \({{\varGamma }}^{*}_{j,k,l}\) (i.e., only a few matrices \({{\varGamma }}^{*}_{j,k,l}\) have some nonzero columns) and some additional regularity conditions. Additionally, it can be shown under sparsity of αg,∗ (i.e., very few vectors \(\alpha ^{g,*}_{j,k}\) are nonzero) and some additional regularity conditions that \(\mathcal {P}_{j}\left (\tilde {\boldsymbol {\alpha }}_{j}^{g} - \boldsymbol {\alpha }_{j}^{g,*} \right ) = O_{P}\left (\left \{\log (p)/n \right \}^{1/2}\right )\). Thus, a scaled version of the remainder term (iii) is oP(n− 1/2) if \(n^{-1/2}\log (p) \to 0\). We refer readers to Chapter 8 of Bühlmann and van de Geer (2011) for a more comprehensive discussion of assumptions required for consistency of the group LASSO.

We now express the de-biased group LASSO estimator for \(\alpha ^{g,*}_{j,k}\) as

$$ \check{\alpha}^{g}_{j,k} = \tilde{\alpha}^{g}_{j,k} + \left( n^{g}\right)^{-1} \tilde{C}^{-1}_{j,k} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} \tilde{{{\varGamma}}}_{j,k,l} \mathcal{V}_{l}^{g} \right)^{\top} \left( \mathbf{X}^{g}_{j} - \mathcal{V}^{g}_{-j} \tilde{\boldsymbol{\alpha}}^{g}_{j} \right). $$
(A.5)

We have established that \(\check {\alpha }^{g}_{j,k}\) can be written as

$$ \tilde{C}_{j,k} \left( \check{\alpha}^{g}_{j,k} - \alpha^{g,*}_{j,k}\right) = \left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} {{\varGamma}}^{*}_{j,k,l} \mathcal{V}_{l}^{g} \right)^{\top} \left( \mathbf{X}^{g}_{j} - \mathcal{V}^{g}_{-j} \boldsymbol{\alpha}^{g,*}_{j} \right) + o_{P}(n^{-1/2}). $$

As stated above, the central limit theorem implies asymptotic normality of \(\check {\alpha }^{g}_{j,k}\).

We now construct an estimate for the variance of \(\check {\alpha }^{g}_{j,k}\). Suppose the residual \(\mathbf {X}^{g}_{j} - \mathcal {V}^{g}_{-j} \boldsymbol {\alpha }^{g,*}_{j}\) is independent of \(\mathcal {V}^{g}\), and let \({\tau _{j}^{g}}\) denote the residual variance

We can approximate the variance of \(\check {\alpha }^{g}_{j,k}\) as

$$ \check{{{\varOmega}}}^{g}_{j,k} = \left( n^{g}\right)^{-2}{\tau_{j}^{g}} \tilde{C}^{-1}_{j,k} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} \tilde{{{\varGamma}}}_{j,k,l} \mathcal{V}_{l}^{g} \right)^{\top} \left( \mathcal{V}^{g}_{k} - {\sum}_{l \neq j, k} \tilde{{{\varGamma}}}_{j,k,l} \mathcal{V}_{l}^{g} \right) \left( \tilde{C}^{-1}_{j,k}\right)^{\top}. $$
(A.6)

As \({\tau _{j}^{g}}\) is typically unknown, we instead us the estimate

$$ \tilde{\tau}_{j}^{g} = \frac{\left\| \mathbf{X}^{g}_{j} - \mathcal{V}^{g}_{-j} \tilde{\boldsymbol{\alpha}}^{g}_{j} \right\|_{2}^{2}}{n - \widehat{df}}, $$

where \(\widehat {df}\) is an estimate of the degrees of freedom for the group LASSO estimate \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\). In our implementation, we use the estimate proposed by Breheny and Huang (2009). Let \(\tilde {\alpha }^g_{j,k,l}\) be the l-th element of \(\tilde {\alpha }^g_{j,k}\), and let \(\mathcal {V}^g_{k,l}\) denote the l-th column of \(\mathcal {V}^g_k\). We then define

$$ \begin{array}{@{}rcl@{}} \bar{\alpha}^g_{j,k,l} = \frac{\langle \mathbf{X}^g_{j} - \mathcal{V}^g_{-j}\tilde{\boldsymbol{\alpha}}^g_j + \mathcal{V}^g_{k,l}\tilde{\alpha}^g_{j,k,l}, \mathcal{V}^g_{k,l}\rangle }{\langle \mathcal{V}^g_{k,l} , \mathcal{V}^g_{k,l} \rangle}, \end{array} $$

and estimate the degrees of freedom as

$$ \begin{array}{@{}rcl@{}} \hat{df} = {\sum}_{k \neq j}{\sum}_{l=1}^{d} \frac{\tilde{\alpha}^{g}_{j,k,l}}{\bar{\alpha}^{g}_{j,k,l}}. \end{array} $$

Appendix B: Generalized Score Matching Estimator

In this section, we establish consistency of the regularized score matching estimator and derive a bias-corrected estimator.

B.1 Form of Generalized Score Matching Loss

Below, we restate Theorem 3 of Yu et al. (2019), which provides conditions under which the score matching loss in Eq. 20 can be expressed as Eq. 21.

Theorem 1.

Assume the following conditions hold:

where the prime symbol denotes the element-wise derivative. Then Eqs. 20 and 21 are equivalent up to an additive constant that does not depend on h.

B.2 Generalized Score Matching Estimator in Low Dimensions

In this section, we provide an explicit form for the generalized score matching estimator in the low-dimensional setting and state its limiting distribution. We first introduce some additional notation below that allows for the generalized score matching loss to be written in a condensed form. Recall the form of the conditional density for the pairwise interaction model in Eq. 22. We define

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\mathcal{V}^{g}_{j,k,1} = \begin{pmatrix} v_{j}^{1/2}\left( X^{g}_{1,j}\right)\dot{\psi}\left( X^{g}_{1,j}, X^{g}_{1,k}\right) \times \phi\left( {W^{g}_{1}}\right) \\ \vdots \\ v_{j}^{1/2}\left( X^{g}_{n^{g},j}\right) \dot{\psi}\left( X^{g}_{n^{g},j}, X^{g}_{n^{g},k}\right) \times \phi\left( W^{g}_{n^{g}}\right) \end{pmatrix}, \\ \\ &&\!\!\!\!\!\mathcal{V}^{g}_{2,j} = \begin{pmatrix} v_{j}^{1/2}\left( X^{g}_{1,j}\right) \times \left\{ \dot{\zeta}\left( X^{g}_{1,j}, \phi_{1}({W^{g}_{1}})\right),\cdots,\dot{\zeta}\left( X^{g}_{1,j}, \phi_{d}({W^{g}_{1}})\right) \right\} \\ \vdots \\ v_{j}^{1/2}\left( X^{g}_{n^{g},j}\right) \times \left\{ \dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{1}(W^{g}_{n^{g}})\right),\cdots,\dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{d}(W^{g}_{n^{g}})\right) \right\} \end{pmatrix},\\\\ &&\!\!\!\!\!\mathcal{U}^{g}_{j,k,1} = \begin{pmatrix} \left\{\dot{v}_{j}\left( X^{g}_{1,j}\right)\dot{\psi}\left( X^{g}_{1,j}, X^{g}_{1,k}\right) + v_{j}\left( X^{g}_{1,j}\right)\ddot{\psi}\left( X^{g}_{1,j}, X^{g}_{1,k}\right) \right\} \times \phi\left( {W^{g}_{1}}\right) \\ \vdots \\ \left\{\dot{v}_{j}\left( X^{g}_{1,j}\right)\dot{\psi}\left( X^{g}_{n^{g},j}, X^{g}_{n^{g},k}\right) + v_{j}\left( X^{g}_{n^{g},j}\right)\ddot{\psi}\left( X^{g}_{1,j}, X^{g}_{n^{g},k}\right) \right\} \times \phi\left( W^{g}_{n^{g}}\right) \end{pmatrix}, \\ \\ &&\!\!\!\!\!\mathcal{U}^{g}_{j,2} = \begin{pmatrix} v_{j}\left( X_{1,j}^{g}\right) \ddot{\zeta}\left( X^{g}_{1,j}, \phi_{1}({W^{g}_{1}})\right) & {\cdots} & v_{j}\left( X_{1,j}^{g}\right) \ddot{\zeta}\left( X^{g}_{1,j}, \phi_{d}({W^{g}_{1}})\right) \\ {\vdots} & {\ddots} & \vdots \\ v_{j}\left( X_{n^{g},j}^{g}\right) \ddot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{1}(W^{g}_{n^{g}})\right) & {\cdots} & v_{j}\left( X_{n^{g},j}^{g}\right) \ddot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{d}(W^{g}_{n^{g}})\right) \end{pmatrix} \\\\ &&\quad\quad\quad +\begin{pmatrix} \dot{v}_{j}\left( X_{1,j}^{g}\right) \dot{\zeta}\left( X^{g}_{1,j}, \phi_{1}({W^{g}_{1}})\right) & {\cdots} & \dot{v}_{j}\left( X_{1,j}^{g}\right) \dot{\zeta}\left( X^{g}_{1,j}, \phi_{d}({W^{g}_{1}})\right) \\ {\vdots} & {\ddots} & \vdots \\\ \dot{v}_{j}\left( X_{n^{g},j}^{g}\right) \dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{1}(W^{g}_{n^{g}})\right) & {\cdots} & \dot{v}_{j}\left( X_{n^{g},j}^{g}\right) \dot{\zeta}\left( X^{g}_{n^{g},j}, \phi_{d}(W^{g}_{n^{g}})\right)\!\! \end{pmatrix}\!, \\ \\ &&\!\!\!\!\!\mathcal{V}^{g}_{j,1} = \begin{pmatrix} \mathcal{V}^{g}_{j,1,1} \\ {\vdots} \\ \mathcal{V}^{g}_{j,p,1} \end{pmatrix}; \quad \mathcal{U}^{g}_{j,1} = \begin{pmatrix} \mathcal{U}^{g}_{1,j,1} \\ {\vdots} \\ \mathcal{U}^{g}_{j,p,1} \end{pmatrix}. \end{array} $$

Let \(\boldsymbol {\alpha }_{j} = \left (\alpha _{j,1}^{\top }, \ldots ,\alpha _{j,p}^{\top }\right )^{\top }\) for and 𝜃j = (𝜃j,1,…,𝜃j,d) for . We can express the empirical score matching loss Eq. 23 as

$$ L^{g}_{n,j}(\boldsymbol{\alpha}_{j}, \boldsymbol{\theta}_{j}) = \left( 2n^{g}\right)^{-1} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j} + \mathcal{V}^{g}_{2,j} \boldsymbol{\theta}_{j} \right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}+ \mathcal{V}^{g}_{2,j} \boldsymbol{\theta}_{j} \right) + \left( n^{g}\right)^{-1}\mathbf{1}^{\top} \left( \mathcal{U}^{g}_{1,j} \boldsymbol{\alpha}_{j} + \mathcal{U}^{g}_{2,j} \boldsymbol{\theta}_{j} \right). $$

We write the gradient of the risk function as

$$ \nabla L^{g}_{n,j}(\boldsymbol{\alpha}_{j}, \boldsymbol{\theta}_{j}) = \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \begin{pmatrix} \boldsymbol{\alpha}_{j} \\ \boldsymbol{\theta}_{j} \end{pmatrix} + \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1} \\ \left( \mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1} \end{pmatrix}. $$

Thus, the minimizer \((\hat {\boldsymbol {\alpha }}^{g}_{j}, \hat {\boldsymbol {\theta }}^{g}_{j})\) of the empirical loss takes the form

$$ \begin{pmatrix} \hat{\boldsymbol{\alpha}}^{g}_{j} \\ \hat{\boldsymbol{\theta}}^{g}_{j} \end{pmatrix} = - \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}^{-1} \begin{pmatrix} \left( \mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1} \\ \left( \mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1} \end{pmatrix}. $$

By applying Theorem 5.23 of van der Vaart (2000),

$$ \left( n^{g}\right)^{1/2} \begin{pmatrix} \hat{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \hat{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} \to_{d} N\left( 0, \begin{pmatrix} A B A \end{pmatrix} \right), $$

where the matrices A and B are defined as

We estimate the variance of \((\hat {\boldsymbol {\alpha }}^{g}_{j}, \hat {\boldsymbol {\theta }}^{g}_{j})\) as \(\hat {{{\varOmega }}}^{g}_{j} = \left (n^{g}\right )^{-1}\hat {A} \hat {B} \hat {A}\), where

$$ \begin{array}{@{}rcl@{}} &&\hat{A} = n^{g} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}^{-1}, \\ &&\hat{B} = \left( n^{g}\right)^{-1}\hat{\xi}^{\top}\hat{\xi}, \quad \hat{\xi} = \begin{pmatrix} \text{diag}\left( \mathcal{V}_{j,1}^{g}\hat{\boldsymbol{\alpha}}^{g}_{j} + \mathcal{V}_{j,2}^{g} \hat{\boldsymbol{\theta}}^{g}_{j} \right)\mathcal{V}_{j,1}^{g} \\ \text{diag}\left( \mathcal{V}_{j,1}^{g}\hat{\boldsymbol{\alpha}}^{g}_{j} + \mathcal{V}_{j,2}^{g} \hat{\boldsymbol{\theta}}^{g}_{j} \right) \mathcal{V}_{j,2}^{g} \end{pmatrix} + \begin{pmatrix} \mathcal{U}_{j,1}^{g} \\ \mathcal{U}_{j,2}^{g} \end{pmatrix}. \end{array} $$

B.3 Consistency of Regularized Generalized Score Matching Estimator

In this subsection, we argue that the regularized generalized score matching estimators \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) and \(\tilde {\boldsymbol {\theta }}^{g}_{j}\) from Eq. 24 are consistent. Let \(\mathcal {P}_{j}(\boldsymbol {\alpha }_{j}) = {\sum }_{j=1}^{p} \|\alpha _{j,k}\|_{2}\). We establish convergence rates of \(\mathcal {P}_{j}\left (\tilde {\boldsymbol {\alpha }}_{j}^{g} - \boldsymbol {\alpha }_{j}^{g,*} \right )\) and \(\left \|\tilde {\boldsymbol {\theta }}^{g}_{j} - \boldsymbol {\theta }_{j}^{g,*} \right \|_{2}\). Our approach is based on proof techniques described in Bühlmann and van de Geer (2011).

Our result requires a notion of compatibility between the penalty function \(\mathcal {P}_{j}\) and the loss \(L^{g}_{n,j}\). Such notions are commonly assumed in the high-dimensional literature. Below, we define the compatibility condition.

Definition 1 (Compatibility Condition).

Let S be a set containing indices of the nonzero elements of \(\boldsymbol {\alpha }_{j}^{g,*}\), and let \(\bar {S}\) denote the complement of S. Let be a (p − 1)d-dimensional vector where the r-th element is one if rS, and zero otherwise. The group LASSO compatibility condition holds for the index set S ⊂{1,…,p} and for constant C > 0 if for all ,

where ∘ is the element-wise product operator.

Theorem 2.

Let \(\mathcal {E}\) be the set

$$ \begin{array}{@{}rcl@{}} \mathcal{E} &=& \left\{ \max_{k \neq j} \left\{ \left\| \left( \mathcal{V}_{j,k,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \right\|_{2}\right\} \leq n^{g}\lambda_{0} \right\} \cap \\ &&\left\{ \left\| \left( \mathcal{V}_{j,k,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \right\|_{2} \leq n^{g}\lambda_{0} \right\} \end{array} $$

for some λ0λ/2. Suppose the compatibility condition also holds. Then on the set \(\mathcal {E}\),

$$ \mathcal{P}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right) + \| \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \|_{2} \leq \frac{\lambda 4 |S|}{C^{2}} . $$

Proof Proof of Theorem 2.

The regularized score matching estimator \(\tilde {\boldsymbol {\alpha }}_{j}^{g}\) necessarily satisfies the following basic inequality:

$$ L^{g}_{n,j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j}, \tilde{\boldsymbol{\theta}}^{g}_{j}\right) + \lambda\mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \leq L^{g}_{n,j}\left( \boldsymbol{\alpha}^{g,*}_{j}, \boldsymbol{\theta}^{g,*}_{j}\right) + \lambda\mathcal{P}_{j}\left( \boldsymbol{\alpha}^{g,*}_{j} \right). $$

With some algebra, this inequality can be rewritten as

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\!\!\!\!\!\!\left( 2n^{g}\right)^{-1} \begin{pmatrix} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right)^{\top} & \left( \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j}\right)^{\top} \end{pmatrix} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}\\ &&\times\begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j} \end{pmatrix} + \lambda\mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \!\leq\! -\left( n^{g}\right)^{-1} \begin{pmatrix} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right)^{\top} & \left( \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j}\right)^{\top} \end{pmatrix}\\ &&\times\begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix}\ + \lambda\mathcal{P}_{j}\left( \boldsymbol{\alpha}^{g,*}_{j} \right). \end{array} $$

By Lemma 1, on the set \(\mathcal {E}\) and using λλ0/2 we get

$$ \begin{array}{@{}rcl@{}} && \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right)^{\top} & \left( \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j}\right)^{\top} \end{pmatrix} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}\\&& \times\begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}^{g,*}_{j} \end{pmatrix} + 2\lambda \mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \leq \lambda\left\|\tilde{\boldsymbol{\theta}}_{j} - \boldsymbol{\theta}^{*}_{j} \right\|_{2} + 2\lambda \mathcal{P}_{j}\left( \boldsymbol{\alpha}^{g,*}_{j} \right) + \lambda\mathcal{P}_{j}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right). \end{array} $$

On the left hand side, we apply the triangle inequality to get

On the right hand side, we observe that

We then have

Now,

where we use the compatiblility condition for the first inequality, and for the second inequality use the fact that

$$ ab \leq b^{2} + a^{2} $$

for any . The conclusion follows immediately. □

If the event \(\mathcal {E}\) occurs with probability tending to one, Theorem 2 implies

$$ \mathcal{P}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right) + \| \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \|_{2} = O_{P}\left( \lambda\right). $$

We select λ so that the event \(\mathcal {E}\) occurs with high probability. For instance, suppose the elements of the matrix

$$ \begin{array}{@{}rcl@{}} \xi = \begin{pmatrix} \text{diag}\left( \mathcal{V}_{j,1}^g\boldsymbol{\alpha}^{g,*}_j + \mathcal{V}_{j,2}^g \boldsymbol{\theta}^{g,*}_j \right)\mathcal{V}_{j,1}^g + \mathcal{U}_{j,1}^g \\ \text{diag}\left( \mathcal{V}_{j,1}^g\boldsymbol{alpha}^{g,*}_j + \mathcal{V}_{j,2}^g \boldsymbol{\theta}^{g,*}_j \right) \mathcal{V}_{j,2}^g + \mathcal{U}_{j,2}^g \end{pmatrix} \end{array} $$

are sub-Gaussian, and consider the event

$$ \begin{array}{@{}rcl@{}} \bar{\mathcal{E}} =&\left| \begin{pmatrix} \left( \mathcal{V}_{j,1}^g\right)^{\top} \left( \mathcal{V}_{j,1}^g \boldsymbol{\alpha}_j^{g,*} +\mathcal{V}_{j,2}^g\boldsymbol{\theta}_j^{g,*} \right) + \left( \mathcal{U}^g_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^g\right)^{\top} \left( \mathcal{V}_{j,1}^g \boldsymbol{\alpha}_j^{g,*} +\mathcal{V}_{j,2}^g\boldsymbol{\theta}_j^{g,*} \right) + \left( \mathcal{U}^g_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix} \right|_{\infty} \leq\frac{n^{g\lambda}_0}{d}, \end{array} $$

where \(\|\cdot \|_{\infty }\) is the \(\ell _{\infty }\) norm. Observing that \(\mathcal {E} \subset \bar {\mathcal {E}}\), it is only necessary to show that \(\bar {\mathcal {E}}\) holds with high probability. It is shown in Corollary 2 of Negahban et al. (2012) that there exist constants u1,u2 > 0 such that with \(\lambda _{0} \asymp \{\log (p)/n\}^{1/2}\), \(\bar {\mathcal {E}}\) holds with probability at least \(1 - u_{1}p^{-u_{2}}\). Thus, \(\mathcal {E}\) occurs with probability tending to one as \(p \to \infty \). For distributions with heavier tails, a larger choice of λ may be required (Yu et al. 2019).

B.4 De-biased Score Matching Estimator

The KKT conditions for the regularized score matching loss imply that the estimator \(\tilde {\boldsymbol {\alpha }}^{g}_{j}\) satisfies

$$ \begin{array}{@{}rcl@{}} \nabla L_{n,j}(\tilde{\boldsymbol{\alpha}}^{g}_{j}, \tilde{\boldsymbol{\theta}}^{g}_{j}) &=& \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}_{j}^{g} \\ \tilde{\boldsymbol{\theta}}_{j}^{g} \end{pmatrix}\\ &&+ \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{U}_{j,1}^{g}\right)^{\top}\mathbf{1} \\ \left( \mathcal{U}_{j,2}^{g}\right)^{\top}\mathbf{1} \end{pmatrix} = \begin{pmatrix} \lambda \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix}. \end{array} $$

With some algebra, we can rewrite the KKT conditions as

$$ \begin{array}{@{}rcl@{}} &&\left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} = \\ &&\lambda \begin{pmatrix} \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix} - \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix}. \end{array} $$

Now, let Σj,n be the matrix

$$ {{\varSigma}}_{j,n} = \left( n^{g}\right)^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix}, $$

let , and let \(\tilde {M}_{j}\) be an estimate of \({{\varSigma }}_{j}^{-1}\). We can now rewrite the KKT conditions as

$$ \begin{array}{@{}rcl@{}} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} &=& \underset{(\mathrm{i})}{\underbrace{\lambda \tilde{M}_{j} \begin{pmatrix} \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix}}} - \underset{(\text{ii})}{\underbrace{\left( n^{g}\right)^{-1} \tilde{M}_{j} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix} }} + \\ &&\quad\quad\quad \underset{(\text{iii})}{ \underbrace{\left( n^{g}\right)^{-1} \left\{ I - {{\varSigma}}_{j,n} \tilde{M}_{j} \right\} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} }}. \end{array} $$
(B.1)

As is the case for the de-biased group LASSO in Appendix ??, the first term (i) in Eq. B.1 depends only on the observed data and can be directly subtracted from the initial estimate. The second term (ii) is asymptotically equivalent to

$$ \left( n^{g}\right)^{-1}{{\varSigma}}_{j}^{-1} \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,1}\right)^{\top} \mathbf{1} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top} \left( \mathcal{V}_{j,1}^{g} \boldsymbol{\alpha}_{j}^{g,*} + \mathcal{V}_{j,2}^{g}\boldsymbol{\theta}_{j}^{g,*} \right) + \left( \mathcal{U}^{g}_{j,2}\right)^{\top} \mathbf{1} \end{pmatrix}, $$
(B.2)

if \(\tilde {M}_{j}\) is a consistent estimate of \({{\varSigma }}_{j}^{-1}\). Using the fact that , it can be seen that Eq. B.2 is an average of i.i.d. random quantities with mean zero. The central limit theorem then implies that any low-dimensional sub-vector is asymptotically normal. The last term (iii) is asymptotically negligible if \(\tilde {M}_{j}\) is an approximate inverse of Σj,n and if \((\tilde {\boldsymbol {\alpha }}_{j}^{g}, \tilde {\boldsymbol {\theta }}_{j}^{g})\) is consistent for \((\boldsymbol {\alpha }_{j}^{g,*}, \boldsymbol {\theta }_{j}^{g,*})\). Thus, for an appropriate choice of \(\tilde {M}_{j}\), we expect asymptotic normality of an estimator of the form

$$ \begin{pmatrix} \check{\boldsymbol{\alpha}}^{g}_{j} \\ \check{\boldsymbol{\theta}}^{g}_{j} \end{pmatrix} = \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} \end{pmatrix} - \lambda \tilde{M}_{j} \begin{pmatrix} \nabla P\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} \right) \\ \mathbf{0} \end{pmatrix}. $$

Before constructing \(\tilde {M}_{j}\), we first provide an alternative expression for \({{\varSigma }}_{j}^{-1}\). We define the d × d matrices \({{\varGamma }}^{*}_{j,k,l}\) and \({{\varDelta }}^{*}_{j,k}\) as

We also define the d × d matrices \({{\varLambda }}^{*}_{j,k}\) as

Additionally, we define the d × d matrices \(C^{*}_{j,k}\) and \(D^{*}_{j}\)

It can be shown that \({{\varSigma }}_{j}^{-1}\) can be expressed as

$$ {{\varSigma}}^{-1}_{j} = \begin{pmatrix} \left( C^{*}_{j,1}\right)^{-1} & {\cdots} & \mathbf{0} & \mathbf{0} \\ {\vdots} & {\ddots} & {\vdots} & \vdots \\ \mathbf{0} & {\cdots} & \left( C^{*}_{j,p}\right)^{-1} & \mathbf{0} \\ \mathbf{0} & {\cdots} & \mathbf{0} & \left( D^{*}_{j}\right)^{-1} \end{pmatrix} \begin{pmatrix} I & -{{\varGamma}}^{*}_{j,1,2} & {\cdots} & -{{\varGamma}}^{*}_{j,1,p} & - {{\varDelta}}^{*}_{j,1} \\ -{{\varGamma}}^{*}_{j,2,1} & I & {\cdots} & -{{\varGamma}}^{*}_{j,2,p} & - {{\varDelta}}^{*}_{j,2} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & \vdots \\ -{{\varGamma}}^{*}_{j,p,1} & -{{\varGamma}}^{*}_{j,p,2} & {\cdots} & I & - {{\varDelta}}^{*}_{j,p} \\ -{{\varLambda}}^{*}_{j,1} & -{{\varLambda}}^{*}_{j,2} & {\cdots} & -{{\varLambda}}^{*}_{j,p} & I \end{pmatrix} . $$

We can thus estimate \({{\varSigma }}_{j}^{-1}\) by estimating each of the matrices \({{\varGamma }}^{*}_{j,k,l}\), \({{\varLambda }}^{*}_{j,k}\), and \({{\varDelta }}^{*}_{j,k}\).

Similar to our discussion of the de-biased group LASSO in Appendix ??, we use a group-penalized variant of the nodewise LASSO to construct \(\tilde {M}_{j}\). We estimate \({{\varGamma }}^{*}_{j,k,l}\) and \({{\varDelta }}^{*}_{j,k}\) as

where ω1,ω2 > 0 are tuning parameters, and ∥⋅∥2,∗ is as defined in Appendix ??. We estimate \({{\varLambda }}^{*}_{j,k}\) as

(B.3)

Additionally, we define the d × d matrices \(\tilde {C}_{j,k}\) and \(\tilde {D}_{j}\)

$$ \begin{array}{@{}rcl@{}} &&\tilde{C}_{j,k} = \left( n^{g}\right)^{-1}\left( \mathcal{V}^{g}_{j,k,1}\right)^{\top} \left( \mathcal{V}_{j,k,1}^{g} - {\sum}_{l \neq k,j} \mathcal{V}_{j,l,1}^{g} \tilde{{{\varGamma}}}_{j,k,l} - \mathcal{V}^{g}_{j,2}\tilde{{{\varDelta}}}_{j,k} \right) \\ &&\tilde{D}_{j} = \left( n^{g}\right)^{-1}\left( \mathcal{V}^{g}_{j,2}\right)^{\top} \left( \mathcal{V}_{j,2}^{g} - {\sum}_{k \neq j} \mathcal{V}_{j,k,1}^{g} \tilde{{{\varLambda}}}_{j,k} \right). \end{array} $$

We then take \(\tilde {M}_{j}\) as

$$ \tilde{M}_{j} = \begin{pmatrix} \tilde{C}^{-1}_{j,1} & {\cdots} & \mathbf{0} & \mathbf{0} \\ {\vdots} & {\ddots} & {\vdots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}^{-1}_{j,p} & \mathbf{0} \\ \mathbf{0} & {\cdots} & \mathbf{0} & \tilde{D}^{-1}_{j} \end{pmatrix} \begin{pmatrix} I & -\tilde{{{\varGamma}}}_{j,1,2} & {\cdots} & -\tilde{{{\varGamma}}}_{j,1,p} & - \tilde{{{\varDelta}}}_{j,1} \\ -\tilde{{{\varGamma}}}_{j,2,1} & I & {\cdots} & -\tilde{{{\varGamma}}}_{j,2,p} & - \tilde{{{\varDelta}}}_{j,2} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} & \vdots \\ -\tilde{{{\varGamma}}}_{j,p,1} & -\tilde{{{\varGamma}}}_{j,p,2} & {\cdots} & I & - \tilde{{{\varDelta}}}_{j,p} \\ -\tilde{{{\varLambda}}}_{j,1} & -\tilde{{{\varLambda}}}_{j,2} & {\cdots} & -\tilde{{{\varLambda}}}_{j,p} & I \end{pmatrix} . $$

When \({{\varGamma }}^{*}_{j,k,l}\), \({{\varDelta }}^{*}_{j,k}\), and \({{\varLambda }}^{*}_{j,k}\) satisfy appropriate sparsity conditions and some additional regularity assumptions, \(\tilde {M}_{j}\) is a consistent estimate of \({{\varSigma }}_{j}^{-1}\) for \(\omega _{1} \asymp \{\log (p)/n\}^{1/2}\) and \(\omega _{2} \asymp \{\log (p)/n\}^{1/2}\) (see, e.g., Chapter 8 of Bühlmann and van de Geer (Bühlmann and van de Geer, 2011) for a more comprehensive discussion). Using the same argument presented in Appendix ??, we are able to obtain the following bound on a scaled version of the remainder term (iii):

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\left\| \begin{pmatrix} \tilde{C}_{j,1} & {\cdots} & \mathbf{0} & \mathbf{0} \\ {\vdots} & {\ddots} & {\vdots} & \vdots \\ \mathbf{0} & {\cdots} & \tilde{C}_{j,p} & \mathbf{0} \\ \mathbf{0} & {\cdots} & \mathbf{0} & \tilde{D}_{j} \end{pmatrix} \left\{ I - \left( n^{g}\right)^{-1}\!\! \begin{pmatrix} \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,1}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \\ \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,1}^{g} & \left( \mathcal{V}_{j,2}^{g}\right)^{\top}\mathcal{V}_{j,2}^{g} \end{pmatrix} \tilde{M}_{j} \right\} \begin{pmatrix} \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}_{j}^{g,*} \\ \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \end{pmatrix} \right\|_{\infty} \leq \\ &&\max\{\omega_{1}, \omega_{2} \} \left\{ \mathcal{P}\left( \tilde{\boldsymbol{\alpha}}^{g}_{j} - \boldsymbol{\alpha}^{g,*}_{j} \right) + \| \tilde{\boldsymbol{\theta}}^{g}_{j} - \boldsymbol{\theta}_{j}^{g,*} \|_{2} \right\}. \end{array} $$

The remainder is oP(n− 1/2) and hence asymptotically negligible if n1/2 \(\max \limits \{\omega _{1}, \omega _{2}\} \lambda \to 0\), where λ is the tuning parameter for the regularized score matching estimator (see Theorem 2).

The de-biased estimate \(\check {\alpha }^{g}_{j,k}\) of \(\alpha ^{g,*}_{j,k}\) can be expressed as

$$ \begin{array}{@{}rcl@{}} \check{\alpha}^{g}_{j,k} &=& \tilde{\alpha}^{g}_{j,k} - \left( n^{g}\right)^{-1} \tilde{C}^{-1}_{j,k} \left( \mathcal{V}^{g}_{j,k,1} - {\sum}_{l \neq j, k} \mathcal{V}_{j,l,1}^{g} \tilde{{{\varGamma}}}_{j,k,l} \right)^{\top} \\ &&\left( \mathcal{V}^{g}_{j,1} \tilde{\boldsymbol{\alpha}}^{g}_{j} + \mathcal{V}^{g}_{j,2} \tilde{\boldsymbol{\theta}}_{j}^{g} + \left( \mathcal{U}_{j,1}^{g}\right)^{\top} \mathbf{1} \right). \end{array} $$
(B.4)

The difference between the de-biased estimator \(\check {\alpha }^{g}_{j,k}\) and the true parameter \(\alpha ^{g,*}_{j,k}\) can be expressed as

$$ \begin{array}{@{}rcl@{}} \tilde{C}_{j,k}\left( \check{\alpha}^{g}_{j,k} - \alpha^{g,*}_{j,k}\right) &=&\!\!\!\! -\left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{j,k,1} - {\sum}_{l \neq j, k} \mathcal{V}_{j,l,1}^{g} {{\varGamma}}^{*}_{j,k,l} \right)^{\top} \left( \mathcal{V}^{g}_{j,1} \boldsymbol{\alpha}^{g,*}_{j} + \mathcal{V}^{g}_{j,2} \boldsymbol{\theta}_{j}^{g,*} + \left( \mathcal{U}_{j,1}^{g}\right)^{\top} \mathbf{1} \right) + \\ &&\!\!\!\!\left( n^{g}\right)^{-1} \left( \mathcal{V}^{g}_{j,2} {{\varDelta}}^{*}_{j,k}\right)^{\top} \left( \mathcal{V}^{g}_{j,1} \boldsymbol{\alpha}^{g,*}_{j} + \mathcal{V}^{g}_{j,2} \boldsymbol{\theta}_{j}^{g,*} + \left( \mathcal{U}_{j,2}^{g}\right)^{\top} \mathbf{1} \right) \bigg\} + o_{P}\left( n^{-1/2}\right). \end{array} $$

As discussed above, the central limit theorem implies asymptotic normality of \(\check {\alpha }^{g}_{j,k}\). We can estimate the asymptotic variance of \(\check {\alpha }^{g}_{j,k}\) as

$$ \left( n^{g}\right)^{-2}\tilde{C}_{j,k}^{-1}\tilde{M}_{j,k}\tilde{\xi}^{\top}\tilde{\xi}\tilde{M}^{\top}_{j,k} \left( \tilde{C}_{j,k}^{-1}\right)^{\top}, $$

where we define

$$ \begin{array}{@{}rcl@{}} \tilde{\xi} &=& \begin{pmatrix} \text{diag}\left( \mathcal{V}_{j,1}^{g} \tilde{\boldsymbol{\alpha}}_{j}^{g} + \mathcal{V}_{j,2}^{g}\tilde{\boldsymbol{\theta}}_{j}^{g} \right)\mathcal{V}_{j,1}^{g} + \mathcal{U}^{g}_{j,1} \\ \text{diag}\left( \mathcal{V}_{j,1}^{g} \tilde{\boldsymbol{\alpha}}_{j}^{g} + \mathcal{V}_{j,2}^{g}\tilde{\boldsymbol{\theta}}_{j}^{g} \right)\mathcal{V}_{j,2}^{g} + \mathcal{U}^{g}_{j,2} \end{pmatrix} \\ \tilde{M}_{j,k} &=& \begin{pmatrix} -\tilde{{{\varGamma}}}_{j,k,1} & {\cdots} & -\tilde{{{\varGamma}}}_{j,k,k-1} & I & -\tilde{{{\varGamma}}}_{j,k,k+1} & {\cdots} & -\tilde{{{\varGamma}}}_{j,k,p} & - \tilde{{{\varDelta}}}_{j,p} \end{pmatrix}. \end{array} $$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hudson, A., Shojaie, A. Covariate-Adjusted Inference for Differential Analysis of High-Dimensional Networks. Sankhya A 84, 345–388 (2022). https://doi.org/10.1007/s13171-021-00252-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13171-021-00252-5

Keywords

  • Differential network
  • Confounding
  • High-dimensional
  • Penalized likelihood
  • De-biased LASSO
  • Exponential family

PACS Nos

  • 62H22 (primary); 62J07 (secondary)