Introduction

In many scientific disciplines, researchers are interested in the linear dependencies and unique relations between larger sets of variables, such as genes1, proteins2, symptoms of a disease3, functional brain connectivity4, etc. There is consensus that computing all pairwise correlations between these variables is misleading, because such correlations do not correct for linear relations that might be due to other variables. Therefore, many researchers recur to calculating partial correlation coefficients, which express the remaining linear dependency between two variables, after the effect of the rest of the variables under study is removed. More specifically, Gaussian Graphical Models (GGMs) have become increasingly popular5,6. These models yield an undirected network (i.e., undirected graph) in which the variables are depicted as nodes and the partial correlations among the variables are visualized as the edges among the nodes. The width of an edge reflects the size of the corresponding partial correlation (see Fig. 1).

Figure 1
figure 1

The undirected network implied by the toy example.

Often, a sparse GGM is fitted, which implies that many of the partial correlations are forced to zero and thus that the corresponding edges in the network can be dropped. In some applications, the assumption of sparsity is intrinsic to the phenomenon under study. For instance, it has been shown that most genetic networks are sparse7,8. In other applications, the assumption of sparsity is motivated through improved interpretability. Indeed, even if the true model is not sparse the sparsity assumption allows to more accurately estimate the remaining parameters when the amount of information per parameter (n/p) is relatively small9, and prevents overfitting10.

Popular methods to estimate sparse GGMs are the regularized nodewise regression approach of Meinshausen and Bühlmann11, the joint sparse regression (SPACE) approach by Peng, et al.12 and the Graphical lasso (Glasso) proposed by Friedman, Hastie and Tibshirani13. These three approaches optimize different objective functions (see Methods section) but all set some of the estimated parameters, and thus some of the network edges, to zero through \({\ell }_{1}\) penalization. This penalization boils down to summing the absolute values of the estimated parameters and adding this sum to the objective function, after multiplying it by a regularization parameter. This parameter determines the impact of the penalty and has to be tuned by the user. Different tuning approaches have been proposed, based on cross-validation, information criteria, or finite sample derivations. Yet, \({\ell }_{1}\) penalization often does not work well. Indeed, recent studies on the use of \({\ell }_{1}\) penalization in standard regression analysis have shown that it tends to yield too many non-zero regression weights14,15,16. Translating these results to the estimation of sparse GGMs, we expect regularized nodewise regression, SPACE and the Glasso to often yield false positives, implying that some of the drawn edges should have been dropped. We will test this hypothesis in extensive simulations, in which we will also evaluate the effect of the tuning approach (i.e., information criteria, k-fold cross validation or finite sample derivations).

To overcome the problem of incorrectly included edges, we will present a novel approach, that we call Partial Correlation Screening (PCS). Our PCS approach consists of two steps. In the first step, we estimate a sparse partial correlation network using one of the state-of-the art methods mentioned above. In the second step, we try to filter out the false positives that will probably be present in the estimated network. To this end, we screen the resulting partial correlation matrix for values that are smaller in absolute value than a cross-validation based threshold and set these to zero. This novel approach is based on earlier work on thresholding after regularization. Specifically, Saligrama et al.17 and Descloux and Sardy18 proposed the idea of thresholding after applying an \({\ell }_{1}\) regularized procedure in the context of regression analysis. Ha and Sun19 presented a related idea for GGMs that consists of estimating the partial correlation matrix using a ridge penalty and then determining the non-zero entries of the matrix by hypothesis testing. Therefore, we will also evaluate what happens if we replace the \({\ell }_{1}\) penalty by a ridge penalty. We will apply the Partial Correlation Screening approach to the same simulated data to show that it indeed performs better. Finally, we will show how the PCS approach can be used to estimate networks based on real datasets: (1) a gene regulatory network of patients with breast cancer, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with Post-Traumatic Stress Disorder (PTSD).

The rest of the article is organized as follows. In the next section, we first present a toy example to introduce some notation and concepts and to illustrate that state-of-the-art estimation approaches yield networks that differ from the population model. Then, using this toy example we show how our PCS procedure works. Next, we discuss the results of two simulation studies, one based on settings that have been used in other papers on this topic and one based on the estimated network for a real data set. We present applications to real datasets. Next, we discuss our findings and formulate conclusions. Finally, the Methods section presents a detailed description of the evaluated tuning approaches for each of the state-of-the-art estimation approaches and of the PCS procedure.

Results

Toy example

The toy data consists of n = 100 observations that are sampled from a p = 6-dimensional multivariate Gaussian distribution. We set the covariance matrix of the distribution Σ to:

$$\Sigma =\left[\begin{array}{cccccc}1.63 & 0.00 & -0.70 & 0.00 & 0.63 & -0.70\\ 0.00 & 1.00 & 0.00 & 0.00 & 0.00 & 0.00\\ -0.70 & 0.00 & 1.68 & 0.00 & -0.70 & -0.13\\ 0.00 & 0.00 & 0.00 & 1.00 & 0.00 & 0.00\\ 0.63 & 0.00 & -0.70 & 0.00 & 1.63 & -0.70\\ -0.70 & 0.00 & -0.13 & 0.00 & -0.70 & 1.68\end{array}\right]$$
(1)

The conditional independence structure of this distribution can be represented by a GGM. The corresponding undirected network is shown in Fig. 1. The six variables X1 to X6 form the set of nodes V = {1, 2, 3, 4, 5, 6}. The set of edges E contains all node pairs (i, j) that are connected in the network, implying that Xi is conditionally dependent on Xj, given all the remaining variables. Thus, variable pairs that do not belong to the edge set are conditionally independent, given all remaining variables. For instance, in this illustration, the network shows an edge between variables X3 and X6. Therefore, these variables are conditionally dependent. However, there is no edge between variables X1 and X2, implying that X1 and X2 are conditionally independent.

Because the variables are Gaussian distributed, a variable pair (i, j) is conditionally independent if and only if their partial correlation given the rest of the variables is zero5. Let’s denote by Γ the partial correlation matrix. The entries ρij|V\{i, j} of this matrix are the partial correlations between variables Xi and Xj, conditioned on the rest of variables. For the toy example the matrix Γ equals:

$$\Gamma =\left[\begin{array}{cccccc}1.00 & 0.00 & -0.45 & 0.00 & 0.00 & -0.45\\ 0.00 & 1.00 & 0.00 & 0.00 & 0.00 & 0.00\\ -0.45 & 0.00 & 1.00 & 0.00 & -0.45 & -0.45\\ 0.00 & 0.00 & 0.00 & 1.00 & 0.00 & 0.00\\ 0.00 & 0.00 & -0.45 & 0.00 & 1.00 & -0.45\\ -0.45 & 0.00 & -0.45 & 0.00 & -0.45 & 1.00\end{array}\right]$$
(2)

We can now define the neighborhood of each node. The neighborhood of node i consists of all the nodes j that form an edge with node i, implying that the partial correlation of Xi and Xj differs from zero. In the toy example the neighborhood of node 1 is formed by nodes 3 and 6, while the neighborhood of node 2 is empty.

Since the true edge set of the toy example is sparse, we can estimate it by means of the Glasso, SPACE, \({\ell }_{1}\) regularized nodewise regression (NR) and ridge nodewise regression (Ridge). Unlike Glasso and SPACE which directly estimate the edge structure, NR computes a regression model per node and thus yields two regression weights for each edge. To combine the information in these two weights into one edge, we can consider two variants, NR-AND and NR-OR. The AND rule means that an edge is only included in the model if both regression weights differ from zero, whereas the OR rule is more liberal and selects all edges for which at least one of the regression weights is not set to zero. Ridge estimates the partial correlations by fitting a regression model for each node using an \({\ell }_{2}\) penalty, which shrinks the regression weights towards zero.

For each of the estimation methods, a number of approaches have been put forward to tune the regularization parameter, the details of which are provided in the Methods section. For Glasso we will use 10-fold cross validation using two different loss functions: the first approach aims to minimize the negative log-likelihood function (CV1) and the second approach focuses on the sum of the prediction errors of each node (CV2). Moreover, we will apply two selection rules when using cross-validation: selecting the model that yields the lowest value and applying the one-standard-error-rule (1se)20. Additionally, we will consider the Bayesian Information Criterion (BIC) and the Extended Bayesian Information Criterion (EBIC)21. To tune the weight of the \({\ell }_{1}\) penalty term in SPACE and NR, we will apply 10-fold CV, its one-standard-error-rule variant, BIC and the finite sample result (FSR) proposed by Meinshausen and Bühlmann11. Note that in NR the tuning is performed for each separate regression. To optimize the weight of the \({\ell }_{2}\) penalty term in Ridge we will apply 10-fold CV for each separate regression. We note that this considered set of procedures is not intended to be exhaustive. Yet, the set is sufficient to illustrate the problem of efficiently tuning the penalty weight when there is limited information.

Figure 2 shows the GGMs obtained with the nineteen considered approaches (i.e., nineteen combinations of estimation method (Glasso, SPACE, NR-AND, NR-OR and Ridge) and tuning options (CV, CV-1se, BIC, EBIC, FSR). We observe that Glasso-CV1-1se (panel c), NR-AND-FSR (panel o) and NR-OR-FSR (panel s) yield a network that is more sparse than the true network. Applying Glasso-CV1-1se all edges are set to zero. Whereas with NR-AND-FSR the edges (1, 3) and (3, 6) are set to zero, with NR-OR-FSR only the edge (3, 6) is set to zero. The other estimation methods yield networks that contain the true set of edges as well as false positives, with the number of false positives varying from ten (Glasso-CV2, panel d; NR-AND-CV-1se, panel m; Ridge-CV, panel t) to one (SPACE-CV-1se, panel i).

Figure 2
figure 2

Estimated undirected networks for the toy example, before applying PCS.

Our PCS procedure aims to remove these false positive edges. The first step of the procedure is to apply one of the nineteen considered approaches. In the second step, we try to single out the false positives by thresholding the entries of the estimated partial correlation matrix. Specifically, only the partial correlations that are larger in absolute value than a given threshold are retained, whereas the others are set to zero and thus removed from the network. The threshold is calibrated by means of 10-fold cross-validation (see Methods section for more information). For the toy example the nineteen computed thresholds range from 0.0001 to 0.283. Figure 3 presents the networks that we obtain by applying these thresholds to the networks in Fig. 2. We observe that PCS-Glasso-CV1 (panel b), PCS-Glasso-CV2 (panel d), PCS-Glasso-CV2-1se (panel e), PCS-Glasso-BIC (panel f), PCS-Glasso-EBIC (panel g), PCS-SPACE-CV (panel h), PCS-SPACE-BIC (panel j), PCS-SPACE-FSR (panel k), PCS-NR-AND-CV (panel l), PCS-NR-AND-CV-1se (panel m), PCS-NR-AND-BIC (panel n), PCS-NR-OR-CV-1se (panel q), PCS-NR-AND-BIC (panel r) and Ridge-CV (panel t) remove the false positives and yield the true network. For, PCS-SPACE-CV-1se (panel i) none of the false positives are removed. PCS-NR-OR-CV (panel p) discards all but one false positive edge. Obviously, the networks with false negatives (panels c, o and s) cannot be improved by PCS.

Figure 3
figure 3

Estimated undirected networks for the toy example, after applying PCS.

Simulation study with synthetic data

In this section we perform an extensive simulation study to evaluate and compare the performance of the different procedures. We will inspect the results obtained with the nineteen combinations used above and study whether they improve when adding PCS. To this end, we replicated the settings used by Liu et al.22, Ravikumar et al.23, Rothman et al.24 and Yuan and Lin25.

Design

Each simulated data set is generated by drawing n independent observations from a p-variate Gaussian distribution with mean zero and partial correlation matrix Γ. We considered two possible sample sizes n = {100, 500} and three different values of p = {20, 60, 200}. We inspected four different specifications of the population partial correlation matrix Γ. To illustrate these specifications for p = 60, we visualized them in Fig. 4.

  1. 1.

    Model 1: 2 neighbor Chain Graph, in which ρii|V\{i} = 1 and ρi, i+1|V\{i, i+1} = ρi−1, i|V\{i, i−1} = −0.4, and all other edges are set to 0.

  2. 2.

    Model 2: 3 neighbor Chain Graph, in which ρii|V\{i} = 1, ρi, i+1|V\{i, i+1} = ρi−1, i|V\{i, i−1} = −0.4, ρi, i+2|V\{i, i+2} = ρi−2, i|V\{i, i−2} = −0.2, and all other edges are set to 0.

  3. 3.

    Model 3: 2 nearest-neighbor graph. We first specify the inverse of the covariance matrix Σ as follows: we randomly select p points from a unit square and we compute all pairwise distances between the p points. Then, for each node the neighborhood set is found by including the two nodes with the smallest distance. Next, the OR-rule is applied to these neighborhood sets to derive the associated undirected network. The off-diagonal elements of the corresponding Σ−1 are randomly chosen from the interval \([\,-\,1,-\,0.5]\cup [0.5,1]\). To ensure that Σ−1 is positive definite, the matrix is transformed as: Σ−1 + (|λ−1)min| + 0.1)Ip where λ−1)min refers to the smallest eigenvalue and Ip is an identity matrix of dimension p. To compute Γ we normalize Σ−1 and we multiply the off-diagonal elements by (−1).

  4. 4.

    Model 4: Random graph. We first specify Σ−1 as follows: each upper triangular element of Σ−1 is set equal to 0.3 with probability ρ and to zero otherwise. We set the probability ρ = {0.1, 0.01, 0.001} when p = {20, 60, 200}, respectively. Next, we set the lower triangular elements equal to the corresponding upper triangular elements. To ensure that Σ−1 is positive definite the matrix is transformed as in model 3. Finally, to compute Γ we normalize Σ−1 and we multiply the off-diagonal elements by (−1).

Figure 4
figure 4

Heatmaps of the true simulated networks when p = 60. White represents partial correlations equal to zero, and black represents partial correlations different from zero.

We generated 100 replicates for each cell of the design. An R script to conduct the simulation experiment is provided in the Supplementary Information.

Performance measures

To evaluate how well the different methods perform in distinguishing between true non-zero partial correlations and true zero ones, we compute the True Positive Rate (TPR) and False Positive Rate (FPR):

$${\rm{TPR}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}}$$
(3)
$${\rm{FPR}}=\frac{{\rm{FP}}}{{\rm{TN}}+{\rm{FP}}}$$
(4)

where TP is the number of true positives (true non-zero edges that are estimated as such), TN is the number of true negatives (true zero edges that are recognized as such), FP is the number of false positives (true zero edges that are estimated as non-zero) and FN is the number of false negatives (true non-zero edges that are estimated as zero). The TPR and FPR coefficients take values in the range [0, 1]. For the TPR a value of 0 indicates that the labeling of edges as non-zero is completely wrong, a value of 0.5 indicates that the procedure cannot do better than random prediction and a value of 1 indicates a perfect recovery of the non-zero edges. Similarly, a FPR value of 0 indicates a perfect recovery of the zero edges, a value of 0.5 indicates that the procedure cannot do better than random prediction and a value of 1 indicates that the labeling of edges as zero is completely wrong. We also report the average number of TP and FP values across the 100 replicates.

Results

Tables 1 to 3 show the average TPR and FPR scores for the different methods under consideration for the different choices of p. We also report the average number of TP and FP for the different methods in Tables 4 to 6. First, we compare the performance of the methods without conducting PCS. The TPR and FPR scores depend strongly on the model used to generate the data and the values of n and p (i.e., amount of available information). In general, when comparing the performance of the different methods in controlling the amount of false non-zero partial correlations, we observe that for every combination of p and n, SPACE and NR perform better than Glasso. The results for Glasso are affected by the penalty tuning approach: whereas using cross-validation tends to introduce a large number of false positives across all different conditions, applying EBIC yields many false negatives. For NR and SPACE, the results depend on n and p and the data generating model. For n = 100, the best overall though still rather bad performance for Models 1 and 4 is obtained with some of the NR-AND variants and for Models 2 and 3 with some of the SPACE variants. We also note that for Model 2 none of the state-of-the-art methods (excluding Ridge) is able to efficiently estimate the true number of positive edges. Furthermore, in the high-dimensional case (i.e., p > n) all approaches perform badly in controlling the amount of false positive edges. When n = 500, the TPR and FPR values are clearly better than for the low-sample size setting and indicate good overall performance.

Table 1 Average true positive rate (TPR) and false positive rate (FPR) over 100 replications when p = 20.
Table 2 Average true positive rate (TPR) and false positive rate (FPR) over 100 replications when p = 60.
Table 3 Average true positive rate (TPR) and false positive rate (FPR) over 100 replications when p = 200.
Table 4 Average number of true positive edges (TP) and false positive edges (FP) over 100 replications when p = 20. For each model the number of non-zero partial correlations are: 19 for Model 1, 37 for Model 2, 16 for Model 3 and 12 for Model 4.
Table 5 Average number of true positive edges (TP) and false positive edges (FP) over 100 replications when p = 60. For each model the number of non-zero partial correlations are: 59 for Model 1, 117 for Model 2, 48 for Model 3 and 13 for Model 4.
Table 6 Average number of true positive edges (TP) and false positive edges (FP) over 100 replications when p = 200. For each model the number of non-zero partial correlations are: 199 for Model 1, 397 for Model 2, 142 for Model 3 and 23 for Model 4.

Turning to the results after applying PCS, we observe in Tables 1 to 6 that PCS-SPACE and PCS-NR estimate networks that contain a smaller number of false positive edges than the state-of-the-art methods without PCS. This improvement is larger for Models 1, 3 and 4 and when n = 100 and n < p, in that PCS is able to control the number of false positive edges without compromising the number of correctly estimated true edges. Furthermore, the performance differences between the different SPACE and NR variants have diminished. PCS-SPACE-BIC has the best overall performance across all the n = 100 conditions. For PCS-Glasso-EBIC the results cannot be improved by PCS, because Glasso-EBIC yields networks with a large number of false negatives. When n = 500, PCS performs almost perfectly in finding the non-zero edges in Models 1, 3 and 4, while for Model 2, the best overall performance is obtained with PCS-NR-OR-BIC when p = 20, 60 and with PCS-NR-OR-FSR when p = 200.

Finally, we study how the sample size and the non-sparsity level influence the height of the estimated threshold in the PCS procedure. For Glasso, SPACE, NR using the AND rule and NR using the OR rule, we estimate a linear mixed effect model with a random intercept in which observations are clustered according to the tuning procedure (i.e., different CV variants, information criteria or finite sample results). The model includes the estimated thresholds as the dependent variable and the sample size and the non-sparsity level as predictors. The non-sparsity level is computed as the number of true non-zero partial correlations divided by the total number of edges in the network. For Ridge we estimate the same model using OLS regression. Table 7 shows the obtained regression coefficients for each estimation procedure. We observe that across the different estimation procedures there is a significant negative relation between the sample size and the estimated threshold value. Also, we found a significant negative relation between the non-sparsity level and the threshold value parameter for all methods except Glasso.

Table 7 Regression coefficients, standard errors (SE), associated Wald’s t-scores and p-values for all predictors in the analysis.

Simulation study based on real data

In this section we simulate data based on the sparse GGM results obtained by Armour, et al.3 for 20 Post-Traumatic Stress Disorder (PTSD) symptoms of 221 U.S. military veterans. The 20 PTSD symptoms are assumed to form four symptoms clusters: intrusions (B1-B5), avoidance (C1-C2), negative alterations in cognitions and mood (D1-D7), and alterations in arousal and reactivity (E1-E6). Armour, et al.3 applied the Glasso-EBIC approach and used bootstrapping techniques to estimate the parameter accuracy and stability of the partial correlation matrix Γ26. The associated network, shown in Fig. 5, reveals strong positive within-cluster connections between nightmares (B2) and flashbacks (B3), blame of self or others (D3) and negative trauma related emotions (D4), detachment (D6) and restricted affect (D7), and hypervigilance (E3) and exaggerated startle response (E4). On top of that, they also find many moderately positive connections within the symptom clusters: for instance, intrusive thoughts (B1) and nightmares (B2), avoidance thoughts (C1) and avoidance remainders (C2), irritability/anger (E1) and self-destructive behaviour (E2), but also between symptom clusters, for instance between loss of interest (D5) and difficulty in concentrating (E5).

Figure 5
figure 5

Heatmap of the true network based on the data on 20 PTSD symptoms. White represents partial correlations equal to zero, and black represents partial correlations different from zero.

To compare the performance before and after using PCS, we drew n observations from a 20-variate Gaussian distribution with mean zero and partial correlation matrix Γ. We used two sample sizes n = {100, 500} and replicated the simulation 100 times.

Table 8 shows the average TPR and FPR scores and Figs. 6 and 7 present heatmaps of the frequency with which the entries of the partial correlation matrix are detected as non-zero. We observe that Partial Correlation Screening (PCS) significantly outperforms Glasso, NR and SPACE. When n = 100, PCS-SPACE-BIC has the best performance in terms of the false positive rate, which is in line with the simulation results on synthetic data. For n = 500, all the estimation procedures using PCS show an average TPR higher than 0.999 and an average FPR below 0.020 (see Table 8).

Table 8 Average true positive rate (TPR) and false positive rate (FPR) over 100 simulations based on the PTSD data.
Figure 6
figure 6

Heatmaps of the frequency with which the edges for the PTSD data based simulations (n = 100) are set to zero by the different methods before applying PCS. White indicates that an edge was excluded from the network in all replications, whereas black reflects that the edge was always retained in the network.

Figure 7
figure 7

Heatmaps of the frequency with which the edges for the PTSD data based simulations (n = 100) are set to zero by the different methods after applying PCS. White indicates that an edge was excluded from the network in all replications, whereas black reflects that the edge was always retained in the network.

Breast cancer data

GGMs have been widely applied to analyze gene expression data, since many authors hypothesize that the complex interactions between genes take the form of sparse pathways or networks27,28,29. More specifically, given mRNA levels of different patients, researchers have studied the conditional dependencies of genes for a variety of diseases1.

We estimate a sparse partial correlation network for gene expression data from a breast cancer study by West et al.30. The dataset contains 7, 129 genes sampled from 49 breast tumor tissues samples: 25 samples from patients diagnosed as estrogen receptor positive and 24 samples from patients diagnosed as estrogen receptor negative. In line with Sheridan et al.31, we focus on a subset of p = 150 genes related to the estrogen receptor gene ERS1. This gene acts as an estrogen-activated transcription factor and has a key role in the proliferation of cancerous cells32.

Table 9 shows how many edges are obtained with the Glasso, NR, SPACE and Ridge techniques under consideration and how much these numbers of edges decrease by applying PCS. It can be concluded that the sparsity level varies considerably depending on the approach used. We observe that when the procedures yield dense networks (i.e. Ridge-CV, Glasso-CV1, Glasso-CV1-1se, Glasso-CV2, Glasso-CV2-1se and NR-OR-CV), applying PCS produces a larger reduction in the number of edges.

Table 9 Estimated number of edges of the gene regulatory network for the breast cancer data, the symptom network of patients with a diagnosis within the nonaffective psychotic spectrum using the BPRS scale and the symptom network of patients with PTSD.

Given that the results vary considerably across the methods, the next question is how we should deal with this uncertainty when interpreting the networks. We opt to combine the results of the different estimation methods33,34, by computing a network that includes all edges that occur in at least two of the nineteen obtained PCS networks. Note that if we apply this combination approach to the estimated PCS networks for the toy example (see Fig. 3), we would recover the true network.

Figure 8 shows the resulting combined network for the breast cancer data. Figure 9 focuses on the sub-network of the genes that are related with the estrogen receptor gene ERS1 (Panel a) and the gene FOXA1 (Panel b). We can identify some important regularity interactions in the estimated GGM. As a first example, the ESR1 (ESR) gene is partially correlated with SLC39A6 (SLC). This gene functions as a zinc transporter and has been shown to be highly expressed in ESR1-positive tumours and is highly significantly associated with the spread of breast cancer to the lymph nodes32. As a second example, we can inspect the genes that belong to the neighborhood of FOXA1 (FOX). FOXA1 has been found to be predominantly expressed in luminal type A carcinomas35 and may prevent metastatic progression of this type of breast cancer36. We observe an edge between FOXA1 (FOX) and AR (AR) (androgen receptor), which is in line with findings that indicate that AR regulates estrogen receptor expression37.

Figure 8
figure 8

Estimated gene regulatory network for the breast cancer data.

Figure 9
figure 9

Estimated sub-network of genes in the neighborhood of ESR1 and FOXA1 for the breast cancer data.

Psychopathological symptoms data

For a long time, modeling approaches to psychopathological data started from the assumption that psychopathological symptoms reflect an underlying mental disorder and thus are caused by this disorder38. This assumption has recently been challenged and an alternative hypothesis has been put forward stating that symptoms are causally active components of a mental disorder39,40. Within this framework, network analysis is then used to study the conditional dependencies between a set of symptoms41,42.

We studied the conditional dependencies of a set of 24 psychopathological symptoms in a sample of 184 patients (189 before patients with missing data were discarded) within the nonaffective psychotic spectrum, that participated in the second wave of the multicenter Genetic Risk and Outcome of Psychosis (GROUP) cohort study43. The symptoms are measured using the Brief Psychiatric Rating Scale (BPRS)44, which captures the following symptoms: Somatic Concern (SmC), Anxiety (Anx), Depression (Dpr), Guilt (Glt), Hostility (Hst), Suspiciousness (Ssp), Unusual Thought (UnT), Grandiosity (Grn), Hallucinations (Hll), Disorientation (Dsr), Conceptual Disorganization (CnD), Excitement (Exc), Elevated mood (ElM), Tension (Tns), Mannerisms (Mnn), Uncooperativeness (Unc), Motor Retardation (MtR), Suicidality (Scd), Self Neglect (SlN), Bizarre Behaviour (BzB), Motor Hyperactivity (MtH), Distractibility (Dst), Emotional Withdrawal (EmW) and Blunted Affect (BlA). Each symptom is rated on a 7-point Likert scale. Because the data is measured on a Likert scale rather than on a continuous one, we apply the nonparanormal transformation proposed by Liu et al.45 that uses the Gaussian copula to transform the data into normal scores.

Table 9 shows the number of edges that result from applying the different methods under consideration. We observe that Ridge-CV, Glasso-CV1, NR-OR-CV and Glasso-BIC estimate the most dense networks and that applying PCS drastically reduces the amount of edges when the original network was not so sparse.

Figure 10 shows the network computed by combining the different PCS networks and discarding edges all that occur only once. Cognitive models that study psychosis have postulated that some of the most prominent symptoms are delusional beliefs (grandiosity, suspiciousness, unusual thoughts)46. We indeed observe that there is strong positive relation between Unusual Thoughts (UnT) and Suspiciousness (Ssp) and between Emotional Withdrawal (EmW) and Blunted Affect (BlA). Also, there is a strong positive relation between Unusual Thoughts (UnT) and Grandiosity (Grn), Motor Retardation (MtR) and Elevated mood (ElM), Anxiety (Anx) and Depression (Dpr), Depression (Dpr) and Guilt (Glt), and Tension (Tns) and Distractibility (Dst).

Figure 10
figure 10

Estimated symptoms network of patients with a diagnosis within the nonaffective psychotic spectrum using the BPRS data.

Post-traumatic stress disorder symptoms data

Finally, we return to the PTSD data that we studied in Subsection: Simulation Study Based on Real Data. Table 9 shows the number of edges for each of the procedures. We observe a similar pattern as in the previous applications. Figure 11 displays the network that results from applying our combination approach to the PCS networks. This combined network recovers the conditional dependencies that Armour et al.3 found to be strongly positive: nightmares (B2) and flashbacks (B3), blame of self or others (D3) and negative trauma related emotions (D4), detachment (D6) and restricted affect (D7), and hypervigilance (E3) and exaggerated startle response (E4).

Figure 11
figure 11

Estimated symptoms network for the PTSD data.

Discussion

In this article, we have demonstrated through an extensive simulation study that the most popular procedures to estimate partial correlation networks, Glasso, SPACE, NR and Ridge, often do not yield the true underlying network, no matter which procedure is applied to select the regularization parameter. Results are heavily influenced by sample size and the number of variables (i.e., the lower the sample size and the higher the number of variables, the worse), with high-dimensional problems being especially difficult. We also note that the Glasso results heavily depend on which approach is used to tune the regularization parameter. Specifically, we found that in the high-dimensional setting, using the BIC or EBIC yields many false negatives and thus an overly sparse network.

Given that the state-of-the-art methods frequently cannot satisfactorily recover the true set of edges, we have presented a novel approach that allows to better control the false positive rate. This procedure boils down to performing an additional second step, after applying one or more state-of-the-art methods of choice. In this second step, we discard the partial correlation coefficients in the estimated network that are smaller in absolute value than a given threshold, which is obtained through cross-validation. Our novel procedure clearly improved the performance of the estimation methods and tuning approaches considered, especially in the settings where the state-of-the-art methods yielded bad results. Whereas PCS-SPACE-BIC seems to be the best choice for small sample size, which method is applied in the first step hardly matters when sample size increases.

We also applied all approaches to three real data sets. The results again show that our PCS approaches yield more sparse networks than the state-of-the art methods. To deal with the multitude of obtained networks, we proposed to compute a network that combines the different PCS estimates, but discard the edges that occurred in only one network. Although results seemed interpretable, future research should investigate further how to efficiently combine the different estimators or how to optimally select among the nineteen obtained networks.

In this paper we used standard simulation settings from the literature to demonstrate the problematic behaviour of existing approaches. It is important to mention that except for Glasso, none of the state-of-the-art procedures studied in this paper estimates a covariance matrix that is positive definite. Also, it is not guaranteed that this property still holds after applying the PCS to Glasso. In future research, it would be useful to investigate the behavior of the different approaches under more difficult settings as well as the theoretical properties of the PCS. This would lead to several possible extensions of our method. One extension targets data in which the assumption of multivariate normality is violated. Here, our approach can be easily extended to make use of techniques to estimate semiparametric undirected graphs45,47,48. We also note that in some applications, such as in psychology data or in the high dimensional setting, some variables might be highly linearly correlated. In this setting, the assumption regarding the regularity of the covariance matrix might not hold. A possible solution is to first cluster the strongly correlated variables and then take this cluster structure into account when estimating the GGM using the PCS approach49,50.

Finally, it is important to note that imposing sparsity might be too stringent in some applications. For instance, in some cases researchers are also interested in detecting partial correlations that are very close to zero. Moreover, it can also happen that the true network is not so sparse to begin with. In such cases, using approaches based on \({\ell }_{1}\) regularization may affect the validity of the results51. Therefore, we believe that future research should also focus on exploring how the methods proposed in this paper behave when the true underlying network is less sparse or includes some very weak edges.

Methods

Partial correlation estimation procedures

In this subsection we present the technical details of the state-of-the-arts methods to estimate sparse partial correlation networks and the associated tuning methods for the regularization parameter.

The graphical lasso

Yuan and Lin25 and Rothman et al.24 proposed a penalized maximum likelihood approach to estimate the inverse of the covariance matrix Σ, denoted by Ω = [ωij]. If S denotes the sample covariance matrix, the problem is to minimize the following penalized log-likelihood function:

$$\hat{{\boldsymbol{\Omega }}}({\lambda }_{1})=\mathop{{\rm{a}}{\rm{r}}{\rm{g}}{\rm{m}}{\rm{a}}{\rm{x}}}\limits_{{\boldsymbol{\Omega }}\succ 0}\{{\rm{t}}{\rm{r}}({\bf{S}}{\boldsymbol{\Omega }})-\,\log \,det({\boldsymbol{\Omega }})+{\lambda }_{1}\sum _{i\ne j}\,|{\omega }_{ij}|\},$$
(5)

where tr(⋅) denotes the trace of a matrix and λ1 > 0 controls the size of the penalty. The penalty term is a proxy of the number of zeros in the precision matrix. The smaller the value of λ1, the more non-zero elements the model includes. Friedman, Hastie and Tibshirani13 proposed an efficient algorithm to implement this method, which is called the Graphical lasso (Glasso). Afterwards, the partial correlation matrix can be computed using the known relation between the entries of the inverse of the covariance matrix and the partial correlation coefficients (see Lemma 1 in Peng et al.12).

For the different applications we select the regularization parameter as follows. We generate a grid of 100 equidistant possible values for λ1 ranging from 0.001 to |max(S)| when p < 100. When p ≥ 100 the sequence limits are (0.05,|max(S)|). We propose six approaches to select the optimal value from this grid. The first one is to implement K-fold cross-validation using the log-likelihood as performance measure (see Section 4.2 in Huang et al.52 and Section 2.3 in Price et al.53). We denote this procedure Glasso-CV1. We split the sample in K subsets. Using all but the k-th subset, we estimate the precision matrix using Glasso and denote this matrix \(\hat{\Omega }\), for different values of λ1. On the basis of the discarded k-th subset we estimate the sample covariance matrix, Sk. Next, for each value of λ1 we compute the following loss function:

$${\rm{C}}{\rm{V}}1({\lambda }_{1})=\mathop{\sum }\limits_{k=1}^{K}\,\{{\rm{t}}{\rm{r}}({{\bf{S}}}^{k}\hat{{\boldsymbol{\Omega }}}({\lambda }_{1}))-\,\log \,det(\hat{{\boldsymbol{\Omega }}}({\lambda }_{1}))\}.$$
(6)

We plot CV1(λ1) versus λ1 and we select the tuning parameter that minimizes the loss function CV1(λ1).

The second approach uses the one-standard-error-rule20. We denote this procedure Glasso-CV1-1se. Using the loss function in Eq. (6), we first compute the standard deviation of CV11(λ1), …, CV1K(λ1):

$${\rm{sd}}({\lambda }_{1})={\rm{sd}}({\rm{CV}}{1}_{1}({\lambda }_{1}),\ldots ,{\rm{CV}}{1}_{K}({\lambda }_{1})).$$
(7)

Next, we compute the standard error of CV1(λ1):

$${\rm{se}}({\lambda }_{1})={\rm{sd}}({\lambda }_{1})/\sqrt{K}.$$
(8)

Finally, given the tuning weight that minimizes the cross-validation error in Eq. (6), denoted by \({\hat{\lambda }}_{1}\), we choose the tuning weight that verifies the following rule:

$${\rm{CV}}1({\lambda }_{1})\le {\rm{CV}}1({\hat{\lambda }}_{1})+{\rm{se}}({\hat{\lambda }}_{1})$$
(9)

The third approach implements K-fold cross-validation using the prediction errors of each node as performance measure. We denote this procedure Glasso-CV2. We split the sample in K subsets. Using all but the k-th subset, we estimate the precision matrix using Glasso and denote this matrix \(\hat{\Omega }\), for different values of λ1. Next, for each value of λ1 we compute the following loss function:

$${\rm{CV}}2({\lambda }_{1})=\mathop{\sum }\limits_{k=1}^{K}\,\mathop{\sum }\limits_{i=1}^{p}\,{\left\Vert {X}_{i}^{k}-\sum _{j\ne i}\left(-\frac{{\hat{\omega }}_{ij}}{{\hat{\omega }}_{ii}}\right){X}_{j}^{k}\right \Vert}^{2}.$$
(10)

We plot CV2(λ1) versus λ1 and we select the tuning parameter that minimizes the loss function CV2(λ1).

The fourth procedure selects the tuning weight by applying the one-standard-error-rule on the cross-validation procedure CV2. We denote this procedure Glasso-CV2-1se.

The fifth and sixth procedures to select the optimal regularization parameter from the 100 considered λ1 values are based on the Bayesian Information Criterion (BIC) or the Extended Bayesian Information Criterion (EBIC). We refer to these procedures as Glasso-BIC and Glasso-EBIC, respectively. We select the value of λ1 that minimizes the following loss function:

$${\rm{EBIC}}({\lambda }_{1})=-\,2 {\mathcal L} (\hat{{\boldsymbol{\Omega }}}({\lambda }_{1}))+\kappa \,\log (n)+4\kappa \gamma \,\log (p)$$
(11)

where \( {\mathcal L} (\cdot )\) is the value of the log-likelihood function that corresponds to the estimated matrix \(\hat{\Omega }\), κ is the number of edges in the estimated network and γ ∈ [0, 1] is a parameter that controls the penalization of the network. If γ = 0, the Eq. (11) corresponds to the classical BIC. Positive values of γ lead to stronger penalization. To compute EBIC, we follow the recommendation of Chen and Chen54 and Foygel and Drton21 and set γ to 0.555,56.

Nodewise regression

Meinshausen and Bühlmann11 proposed to estimate the set of network edges by performing p separate lasso regressions:

$${\hat{{\boldsymbol{\beta }}}}_{i}({\lambda }_{2})=\mathop{{\rm{a}}{\rm{r}}{\rm{g}}{\rm{m}}{\rm{i}}{\rm{n}}}\limits_{{\beta }_{ij}}\left\{\frac{1}{2}{\left\Vert {X}_{i}-\sum _{j\ne i}{\beta }_{ij}{X}_{j}\right\Vert }^{2}+{\lambda }_{2}\sum _{j\ne i}\,|{\beta }_{ij}|\right\},$$
(12)

where \({\hat{\beta }}_{i}\) is a vector that contains the p − 1 estimated regression weights of node i and λ2 > 0 is the regularization parameter that controls the number of non-zero elements in the neighborhood of node i. The set of edges can be computed with the AND-rule:

estimate an edge between nodes i and j ⇔ \({\hat{\beta }}_{ij}\) ≠ 0 and \({\hat{\beta }}_{ji}\) ≠ 0

yielding the NR-AND procedure.

Alternatively, we can use the NR-OR method and compute the edge set with the OR-rule:

estimate an edge between nodes i and j ⇔ \({\hat{\beta }}_{ij}\) ≠ 0 or \({\hat{\beta }}_{ji}\) ≠ 0.

Next, the partial correlation matrix can be computed using the relation between the prediction errors of the best linear predictor of each node and the partial correlation coefficients (see Lemma 1 in Peng et al.12).

To select the tuning parameter λ2 for each regression separately we generate a grid of 100 possible values using the sequence generated with the function glmnet of the R package glmnet57. We consider four different tuning procedures. First, we can perform K-fold cross-validation. Discarding the k-th subset we estimate the vector of regression weights \({\hat{\beta }}_{i}\) using a lasso regression. We select the value of λ2 that minimizes the following loss function:

$${\rm{CV}}({\lambda }_{2})=\mathop{\sum }\limits_{k=1}^{K}\,{\left\Vert {X}_{i}^{k}-\sum _{j\ne i}{\hat{\beta }}_{ij}{X}_{j}^{k}\right\Vert }^{2},$$
(13)

where Xik are the observations in the discarded subset k.

The second approach adapts this cross-validation approach by using the one-standard-error-rule. We denote this procedure NR-CV-1se.

The third procedure to select the regularization parameter, NR-BIC, involves computing the Bayesian Information Criterion (BIC) for different values of λ2. For each node, we select the value of λ2 that minimizes the following loss function:

$${{\rm{BIC}}}_{i}({\lambda }_{2})=n{\rm{RSS}}({\hat{{\boldsymbol{\beta }}}}_{i})+{\kappa }_{i}\,\log (n)$$
(14)

where RSS(⋅) is the value of the residual sum of squares for the i-th regression and κi is the number of elements in the estimated neighborhood of node i.

The fourth procedure is NR-FSR and uses a Finite Sample Result. Meinshausen and Bühlmann11 show that under certain assumptions regarding the sparsity and regularity conditions of the covariance matrix and the regression weights, the neighborhood of a node i will contain at most α ∈ (0, 1) false positive edges if the \({\ell }_{1}\) penalty parameter is set as: \({\lambda }_{2}(\alpha )=\frac{2}{\sqrt{n}}{\Phi }^{-1}(1-\frac{\alpha }{2{p}^{2}})\), where Φ−1 is the inverse of the c.d.f. of N(0, 1). We set the bound to the proportion of the false positive edges to α = 0.05.

Joint sparse linear regression

Peng et al.12 proposed to estimate the partial correlation matrix by minimizing the following joint sparse regression (SPACE):

$$\hat{\Gamma }({\lambda }_{3})=\mathop{{\rm{a}}{\rm{r}}{\rm{g}}{\rm{m}}{\rm{i}}{\rm{n}}}\limits_{{\rho }_{ij},{\omega }_{ii}}\left \{\frac{1}{2}\left(\mathop{\sum }\limits_{i=1}^{p}\,{\left\Vert {X}_{i}-\sum _{j\ne i}{\rho }_{ij| V{\rm{\setminus }}\{i,j\}}\sqrt{\frac{{\omega }_{jj}}{{\omega }_{ii}}}{X}_{j}\right\Vert }^{2}\right)+{\lambda }_{3}\sum _{1\le i < j\le p}\,|{\rho }_{ij}|\right\},$$
(15)

where ωii is the residual variance of the optimal prediction of Xi given all remaining variables, which is equivalent to the the i-th diagonal element of the matrix Ω and λ3 > 0 is the regularization parameter that controls the number of non-zero elements in the partial correlation matrix Γ.

Given a grid of 100 equidistant values for λ3 ranging from \(\sqrt{n}{\Phi }^{-1}(1-\frac{0.9}{2{p}^{2}})\) to \(\sqrt{n}{\Phi }^{-1}(1-\frac{1e-4}{2{p}^{2}})\), there are four different procedures to calibrate the tuning parameter λ3. We first propose to perform K-fold cross-validation, yielding SPACE-CV. We first split the sample into K subsets and select the parameter value that minimizes the following loss function:

$${\rm{C}}{\rm{V}}({\lambda }_{3})=\mathop{\sum }\limits_{k=1}^{K}\,\mathop{\sum }\limits_{i=1}^{p}\,{\left\Vert {X}_{i}^{k}-\sum _{j\ne i}{\hat{\rho }}_{ij| V{\rm{\setminus }}\{i,j\}}\sqrt{\frac{{\hat{\omega }}_{jj}}{{\hat{\omega }}_{ii}}}{X}_{j}^{k}\right\Vert }^{2}.$$
(16)

The second procedure again adapts this cross-validation approach by using the one-standard-error-rule. We denote this procedure SPACE-CV-1se.

The third procedure to select the regularization parameter involves computing the Bayesian Information Criterion (BIC) for the 100 values of λ3. First, we compute for each node the residual sum of squares:

$${{\rm{R}}{\rm{S}}{\rm{S}}}_{i}({\hat{\rho }}_{ij| V{\rm{\setminus }}\{i,j\}},{\hat{\omega }}_{ii})={\left\Vert {X}_{i}-\sum _{j\ne i}{\hat{\rho }}_{ij| V{\rm{\setminus }}\{i,j\}}\sqrt{\frac{{\hat{\omega }}_{jj}}{{\hat{\omega }}_{ii}}}{X}_{j}\right\Vert }^{2},$$

Next, we select the value of λ3 by minimizing:

$${\rm{B}}{\rm{I}}{\rm{C}}({\lambda }_{3})=\mathop{\sum }\limits_{i=1}^{p}\,\left(n{{\rm{R}}{\rm{S}}{\rm{S}}}_{i}({\hat{\rho }}_{ij| V{\rm{\setminus }}\{i,j\}},{\hat{\omega }}_{ii})+{\kappa }_{i}\,\log (n)\right)$$
(17)

where κi is the number of elements in the estimated neighborhood of node i.

The fourth procedure SPACE-FSR is based on the Finite Sample Result by Peng et al.12. These authors show that under certain assumptions regarding the sparsity and regularity conditions of the covariance matrix and the regression weights, the neighborhood of a node i will contain at most α ∈ (0, 1) false positive edges if the penalty parameter is set as: \({\lambda }_{3}(\alpha )=\sqrt{n}{\Phi }^{-1}(1-\frac{\alpha }{2{p}^{2}})\), where Φ−1 is the inverse of the c.d.f. of N(0, 1). We again set the bound to the proportion of the false positive edges to α = 0.05.

Partial correlation estimation using ridge regression

Ha and Sun19 proposed to estimate a penalized partial correlations matrix using a ridge penalty. We apply a simpler version of their method by performing p separate ridge regressions:

$${\hat{{\boldsymbol{\delta }}}}_{i}({\lambda }_{4})=\mathop{{\rm{a}}{\rm{r}}{\rm{g}}{\rm{m}}{\rm{i}}{\rm{n}}}\limits_{{\delta }_{ij}}\left\{\frac{1}{2}{\left\Vert {X}_{i}-\sum _{j\ne i}{\delta }_{ij}{X}_{j}\right\Vert }^{2}+{\lambda }_{4}\sum _{j\ne i}\,{\delta }_{ij}^{2}\right\},$$
(18)

where \({\hat{\delta }}_{i}\) is a vector that contains the p − 1 estimated regression weights for node i and λ4 > 0 is the regularization parameter that controls the amount of shrinkage of the regression weights toward zero in the neighborhood of node i. The partial correlation matrix is computed using the relation between the prediction errors of the best linear predictor of each node and the partial correlation coefficients (see Lemma 1 in Peng et al.12).

To select the tuning parameter λ4 for each regression separately we generate a grid of 100 possible values using the sequence generated with the function glmnet of the R package glmnet57. We select the regularization parameter by performing K-fold cross-validation. Discarding the k-th subset we estimate the vector of regression weights \({\hat{\delta }}_{i}\) using ridge regression. We select the value of λ4 that minimizes the following loss function:

$${\rm{CV}}({\lambda }_{4})=\mathop{\sum }\limits_{k=1}^{K}\,{\left\Vert {X}_{i}^{k}-\sum _{j\ne i}{\hat{\delta }}_{ij}{X}_{j}^{k}\right\Vert }^{2},$$
(19)

where Xik are the observations in the discarded subset k. We denote this procedure Ridge-CV.

Partial correlation screening procedure

In this subsection we present the technical details of the Partial Correlation Screening (PCS) algorithm. The procedure estimates the set of edges in two steps. In the first step, we determine a sparse partial correlation network, denoted by \(\hat{\Gamma }=[{\hat{\rho }}_{ij|V\backslash \{i,j\}}]\), using one of the methods that we discussed in the previous subsection.

In the second step of the algorithm, we detect unimportant pairs of variables by thresholding the partial correlations estimated in the first step. For i ∈ V and a threshold parameter τ ∈ (0, 1), we estimate the neighborhood of node i as follows

$${\hat{\mathscr A}}_{i,\tau }=\{j\in V\backslash \{i\}:|{\hat{\rho }}_{ij|V\backslash \{i,j\}}| > \tau \}.$$
(20)

The algorithm outputs the estimated set of edges for a given threshold τ:

$${\hat{E}}_{\tau }=\{(i,j)\in V:|{\hat{\rho }}_{ij|V\backslash \{i,j\}}| > \tau \}.$$
(21)

Finally, the prediction error of the regression of each node i conditioned on the variables that belong to the estimated neighborhood set \({\hat{\mathscr A}}_{i,\tau }\) is given by

$${\hat{\varepsilon }}_{i,\tau }={X}_{i}-\sum _{j\in {\hat{\mathscr A}}_{i,\tau }}\,{\hat{\theta }}_{ij,\tau }{X}_{j},$$

where \({\hat{\theta }}_{i,\tau }\) is the vector of estimated regression coefficients of node i ∈ V given the variables in the estimated neighborhood set \({\hat{\mathscr A}}_{i,\tau }\).

Choice of the tuning parameter

To select the threshold parameter τ, we perform K-fold cross validation. We generate a sequence of 100 equidistant values for the threshold τ ranging from 0.0001 to 1. The procedure to select the threshold uses a double-loop. First, for each of the estimation procedures proposed in the previous subsection, we select the regularization parameter λ. Second, we split the sample in K subsets. Using all but the k-the subset, we estimate a sparse partial correlation network using the selected regularization parameter λ. Next, for each value of τ in the grid, we estimate the neighborhood of each node (see Eq. (20)) and the regression weights vector \({\hat{\theta }}_{i,\tau }\). For each value of τ we compute the following loss function:

$$CV(\tau )=\mathop{\sum }\limits_{k=1}^{K}\,\mathop{\sum }\limits_{i=1}^{p}\,{\left\Vert {X}_{i}^{k}-\sum _{j\in {\hat{\mathscr A}}_{i,\tau }}{\hat{\theta }}_{ij,\tau }{X}_{j}^{k}\right\Vert }^{2}.$$
(22)

We plot CV(τ) versus τ and we select the threshold parameter that minimizes the loss function CV(τ).