Estimating psychological networks and their accuracy: A tutorial paper
Abstract
The usage of psychological networks that conceptualize behavior as a complex interplay of psychological and other components has gained increasing popularity in various research fields. While prior publications have tackled the topics of estimating and interpreting such networks, little work has been conducted to check how accurate (i.e., prone to sampling variation) networks are estimated, and how stable (i.e., interpretation remains similar with less observations) inferences from the network structure (such as centrality indices) are. In this tutorial paper, we aim to introduce the reader to this field and tackle the problem of accuracy under sampling variation. We first introduce the current stateoftheart of network estimation. Second, we provide a rationale why researchers should investigate the accuracy of psychological networks. Third, we describe how bootstrap routines can be used to (A) assess the accuracy of estimated network connections, (B) investigate the stability of centrality indices, and (C) test whether network connections and centrality estimates for different variables differ from each other. We introduce two novel statistical methods: for (B) the correlation stability coefficient, and for (C) the bootstrapped difference test for edgeweights and centrality indices. We conducted and present simulation studies to assess the performance of both methods. Finally, we developed the free Rpackage bootnet that allows for estimating psychological networks in a generalized framework in addition to the proposed bootstrap methods. We showcase bootnet in a tutorial, accompanied by R syntax, in which we analyze a dataset of 359 women with posttraumatic stress disorder available online.
Keywords
Network psychometrics Psychological networks Replicability Bootstrap TutorialIntroduction
In the last five years, network research has gained substantial attention in psychological sciences (Borsboom and Cramer 2013; Cramer et al. 2010). In this field of research, psychological behavior is conceptualized as a complex interplay of psychological and other components. To portray a potential structure in which these components interact, researchers have made use of psychological networks. Psychological networks consist of nodes representing observed variables, connected by edges representing statistical relationships. This methodology has gained substantial footing and has been used in various different fields of psychology, such as clinical psychology (e.g., Boschloo et al., 2015; Fried et al., 2015; McNally et al., 2015; Forbush et al., 2016), psychiatry (e.g., Isvoranu et al., 2016, 2017; van Borkulo et al., 2015), personality research (e.g., Costantini et al., 2015a; Cramer et al., 2012), social psychology (e.g., Dalege et al., 2016), and quality of life research (Kossakowski et al. 2015).
These analyses typically involve two steps: (1) estimate a statistical model on data, from which some parameters can be represented as a weighted network between observed variables, and (2), analyze the weighted network structure using measures taken from graph theory (Newman, 2010) to infer, for instance, the most central nodes.^{1} Step 1 makes psychological networks strikingly different from network structures typically used in graph theory, such as power grids (Watts and Strogatz 1998), social networks (Wasserman and Faust 1994) or ecological networks (Barzel and Biham 2009), in which nodes represent entities (e.g., airports, people, organisms) and connections are generally observed and known (e.g., electricity lines, friendships, mutualistic relationships). In psychological networks, the strength of connection between two nodes is a parameter estimated from data. With increasing sample size, the parameters will be more accurately estimated (close to the true value). However, in the limited sample size psychological research typically has to offer, the parameters may not be estimated accurately, and in such cases, interpretation of the network and any measures derived from the network is questionable. Therefore, in estimating psychological networks, we suggest a third step is crucial: (3) assessing the accuracy of the network parameters and measures.
Only few analyses so far have taken accuracy into account (e.g., Fried et al., 2016), mainly because the methodology has not yet been worked out. This problem of accuracy is omnipresent in statistics. Imagine researchers employ a regression analysis to examine three predictors of depression severity, and identify one strong, one weak, and one unrelated regressor. If removing one of these three regressors, or adding a fourth one, substantially changes the regression coefficients of the other regressors, results are unstable and depend on specific decisions the researchers make, implying a problem of accuracy. The same holds for psychological networks. Imagine in a network of psychopathological symptoms that we find that symptom A has a much higher node strength than symptom B, leading to the clinical interpretation that A may be a more relevant target for treatment than the peripheral symptom B (Fried et al. 2016). Clearly, this interpretation relies on the assumption that the centrality estimates are indeed different from each other. Due to the current uncertainty, there is the danger to obtain network structures sensitive to specific variables included, or sensitive to specific estimation methods. This poses a major challenge, especially when substantive interpretations such as treatment recommendations in the psychopathological literature, or the generalizability of the findings, are important. The current replication crisis in psychology (Open Science Collaboration 2015) stresses the crucial importance of obtaining robust results, and we want the emerging field of psychopathological networks to start off on the right foot.
The remainder of the article is structured into three sections. In the first section, we give a brief overview of often used methods in estimating psychological networks, including an overview of opensource software packages that implement these methods available in the statistical programming environment R (R Core Team 2016). In the second section, we outline a methodology to assess the accuracy of psychological network structures that includes three steps: (A) estimate confidence intervals (CIs) on the edgeweights, (B) assess the stability of centrality indices under observing subsets of cases, and (C) test for significant differences between edgeweights and centrality indices. We introduce the freely available R package, bootnet,^{3} that can be used both as a generalized framework to estimate various different network models as well as to conduct the accuracy tests we propose. We demonstrate the package’s functionality of both estimating networks and checking their accuracy in a stepbystep tutorial using a dataset of 359 women with posttraumatic stress disorder (PTSD; Hien et al., 2009) that can be downloaded from the Data Share Website of the National Institute on Drug Abuse. Finally, in the last section, we show the performance of the proposed methods for investigating accuracy in three simulations studies. It is important to note that the focus of our tutorial is on crosssectional network models that can readily be applied to many current psychological datasets. Many sources have already outlined the interpretation of probabilistic network models (e.g., Epskamp et al., 2016; Koller and Friedman, 2009; Lauritzen, 1996), as well as network inference techniques, such as centrality measures, that can be used once a network is obtained (e.g., Costantini et al., 2015a; Kolaczyk, 2009; Newman, 2004; Sporns et al., 2004).
To make this tutorial standalone readable for psychological researchers, we included a detailed description of how to interpret psychological network models as well as an overview of network measures in the Supplementary Materials. We hope that this tutorial will enable researchers to gauge the accuracy and certainty of the results obtained from network models, and to provide editors, reviewers, and readers of psychological network papers the possibility to better judge whether substantive conclusions drawn from such analyses are defensible.
Estimating psychological networks
As described in more detail in the Supplementary Materials, a popular network model to use in estimating psychological networks is a pairwise Markov random field (PMRF; Costantini et al. 2015a, van Borkulo et al. 2014), on which the present paper is focused. It should be noted, however, that the described methodology could be applied to other network models as well. A PMRF is a network in which nodes represent variables, connected by undirected edges (edges with no arrowhead) indicating conditional dependence between two variables; two variables that are not connected are independent after conditioning on other variables. When data are multivariate normal, such a conditional independence would correspond to a partial correlation being equal to zero. Conditional independencies are also to be expected in many causal structures (Pearl 2000). In crosssectional observational data, causal networks (e.g., directed networks) are hard to estimate without stringent assumptions (e.g., no feedback loops). In addition, directed networks suffer from a problem of many equivalent models (e.g., a network A→B is not statistically distinguishable from a network A←B; MacCallum et al., 1993). PMRFs, however, are well defined and have no equivalent models (i.e., for a given PMRF, there exists no other PMRF that describes exactly the same statistical independence relationships for the set of variables under consideration). Therefore, they facilitate a clear and unambiguous interpretation of the edgeweight parameters as strength of unique associations between variables, which in turn may highlight potential causal relationships.
When the data are binary, the appropriate PRMF model to use is called the Ising model (van Borkulo et al. 2014), and requires binary data to be estimated. When the data follow a multivariate normal density, the appropriate PRMF model is called the Gaussian graphical model (GGM; Costantini et al., 2015a, Lauritzen, 1996), in which edges can directly be interpreted as partial correlation coefficients. The GGM requires an estimate of the covariance matrix as input,^{4} for which polychoric correlations can also be used in case the data are ordinal (Epskamp 2016). For continuous data that are not normally distributed, a transformation can be applied (e.g., by using the nonparanormal transformation; Liu et al., 2012) before estimating the GGM. Finally, mixed graphical models can be used to estimate a PMRF containing both continuous and categorical variables (Haslbeck and Waldorp 2016b).
Dealing with the problem of small N in psychological data
Estimating a PMRF features a severe limitation: the number of parameters to estimate grows quickly with the size of the network. In a tennode network, 55 parameters (ten threshold parameters and 10×9/2=45 pairwise association parameters) need be estimated already. This number grows to 210 in a network with 20 nodes, and to 1275 in a 50node network. To reliably estimate that many parameters, the number of observations needed typically exceeds the number available in characteristic psychological data. To deal with the problem of relatively small datasets, recent researchers using psychological networks have applied the ‘least absolute shrinkage and selection operator’ (LASSO; Tibshirani, 1996). This technique is a form of regularization. The LASSO employs such a regularizing penalty by limiting the total sum of absolute parameter values—thus treating positive and negative edgeweights equally—leading many edge estimates to shrink to exactly zero and dropping out of the model. As such, the LASSO returns a sparse (or, in substantive terms, conservative) network model: only a relatively small number of edges are used to explain the covariation structure in the data. Because of this sparsity, the estimated models become more interpretable. The LASSO utilizes a tuning parameter to control the degree to which regularization is applied. This tuning parameter can be selected by minimizing the Extended Bayesian Information Criterion (EBIC; Chen and Chen, 2008). Model selection using the EBIC has been shown to work well in both estimating the Ising model (Foygel Barber and Drton 2015; van Borkulo et al. 2014) and the GGM (Foygel and Drton 2010). The remainder of this paper focuses on the GGM estimation method proposed by Foygel & Drton, (2010; see also Epskamp and Fried, 2016, for a detailed introduction of this method for psychological researchers).
Estimating regularized networks in R is straightforward. For the Ising model, LASSO estimation using EBIC has been implemented in the IsingFit package (van Borkulo et al. 2014). For GGM networks, a wellestablished and fast algorithm for estimating LASSO regularization is the graphical LASSO (glasso; Friedman et al., 2008), which is implemented in the package glasso (Friedman et al. 2014). The qgraph package (Epskamp et al. 2012) utilizes glasso in combination with EBIC model selection to estimate a regularized GGM. Alternatively, the huge (Zhao et al. 2015) and parcor (Krämer et al. 2009) packages implement several regularization methods—including also glasso with EBIC model selection—to estimate a GGM. Finally, mixed graphical models have been implemented in the mgm package (Haslbeck and Waldorp 2016a).
Network accuracy
The above description is an overview of the current state of network estimation in psychology. While network inference is typically performed by assessing edge strengths and node centrality, little work has been done in investigating how accurate these inferences are. This section will outline methods that can be used to gain insights into the accuracy of edge weights and the stability of centrality indices in the estimated network structure. We outline several methods that should routinely be applied after a network has been estimated. These methods will follow three steps: (A) estimation of the accuracy of edgeweights, by drawing bootstrapped CIs; (B) investigating the stability of (the order of) centrality indices after observing only portions of the data; and (C) performing bootstrapped difference tests between edgeweights and centrality indices to test whether these differ significantly from each other. We introduced these methods in decreasing order of importance: while (A) should always be performed, a researcher not interested in centrality indices might not perform other steps, whereas a researcher not interested in testing for differences might only perform (A) and (B). studies have been conducted to assess the performance of these methods, which are reported in a later section in the paper.
Edgeweight accuracy
To assess the variability of edgeweights, we can estimate a CI: in 95 % of the cases such a CI will contain the true value of the parameter. To construct a CI, we need to know the sampling distribution of the statistic of interest. While such sampling distributions can be difficult to obtain for complicated statistics such as centrality measures, there is a straightforward way of constructing CIs many statistics: bootstrapping (Efron 1979). Bootstrapping involves repeatedly estimating a model under sampled or simulated data and estimating the statistic of interest. Following the bootstrap, a 1−α CI can be approximated by taking the interval between quantiles 1/2α and 1−1/2α of the bootstrapped values. We term such an interval a bootstrapped CI. Bootstrapping edgeweights can be done in two ways: using nonparametric bootstrap and parametric bootstrap (Bollen and Stine 1992). In nonparametric bootstrapping, observations in the data are resampled with replacement to create new plausible datasets, whereas parametric bootstrapping samples new observations from the parametric model that has been estimated from the original data; this creates a series of values that can be used to estimate the sampling distribution. Bootstrapping can be applied as well to LASSO regularized statistics (Hastie et al. 2015).
With N_{B} bootstrap samples, at maximum a CI with α = 2/N_{B} can be formed. In this case, the CI equals the range of bootstrapped samples and is based on the two most extreme samples (minimum and maximum). As such, for a certain level of α at the very least 2/α bootstrap samples are needed. It is recommended, however, to use more bootstrap samples to improve consistency of results. The estimation of quantiles is not trivial and can be done using various methods (Hyndman and Fan 1996). In unreported simulation studies available on request, we found that the default quantile estimation method used in R (type 7; Gumbel, 1939) constructed CIs that were too small when samples are normally or uniformly distributed, inflating α. We have thus changed the method to type 6, described in detail by Hyndman and Fan (1996), which resulted in CIs of proper width in uniformly distributed samples, and slightly wider CIs when samples were distributed normally. Simulation studies below that use type 6 show that this method allows for testing of significant differences at the correct α level.
Nonparametric bootstrapping can always be applied, whereas parametric bootstrapping requires a parametric model of the data. When we estimate a GGM, data can be sampled by sampling from the multivariate normal distribution through the use of the R package mvtnorm (Genz et al. 2008); to sample from the Ising model, we have developed the R package IsingSampler (Epskamp 2014). Using the GGM model, the parametric bootstrap samples continuous multivariate normal data—an important distinction from ordinal data if the GGM was estimated using polychoric correlations. Therefore, we advise the researcher to use the nonparametric bootstrap when handling ordinal data. Furthermore, when LASSO regularization is used to estimate a network, the edgeweights are on average made smaller due to shrinkage, which biases the parametric bootstrap. The nonparametric bootstrap is in addition fully datadriven and requires no theory, whereas the parametric bootstrap is more theory driven. As such, we will only discuss the nonparametric bootstrap in this paper and advice the researcher to only use parametric bootstrap when no regularization is used and if the nonparametric results prove unstable or to check for correspondence of bootstrapped CIs between both methods.
It is important to stress that the bootstrapped results should not be used to test for significance of an edge being different from zero. While unreported simulation studies showed that observing if zero is in the bootstrapped CI does function as a valid nullhypothesis test (the nullhypothesis is rejected less than α when it is true), the utility of testing for significance in LASSO regularized edges is questionable. In the case of partial correlation coefficients, without using LASSO the sampling distribution is well known and pvalues are readily available. LASSO regularization aims to estimate edges that are not needed to be exactly zero. Therefore, observing that an edge is not set to zero already indicates that the edge is sufficiently strong to be included in the model. In addition, as later described in this paper, applying a correction for multiple testing is not feasible, In sum, the edgeweight bootstrapped CIs should not be interpreted as significance tests to zero, but only to show the accuracy of edgeweight estimates and to compare edges to oneanother.
When the bootstrapped CIs are wide, it becomes hard to interpret the strength of an edge. Interpreting the presence of an edge, however, is not affected by large CIs as the LASSO already performed model selection. In addition, the sign of an edge (positive or negative) can also be interpreted regardless of the width of a CI as the LASSO rarely retains an edge in the model that can either be positive or negative. As centrality indices are a direct function of edge weights, large edge weight CIs will likely result in a poor accuracy for centrality indices as well. However, differences in centrality indices can be accurate even when there are large edge weight CIs, and viceversa; and there are situations where differences in centrality indices can also be hard to interpret even when the edge weight CIs are small (for example, when centrality of nodes do not differ from oneanother). The next section will detail steps to investigate centrality indices in more detail.
Centrality stability
While the bootstrapped CIs of edgeweights can be constructed using the bootstrap, we discovered in the process of this research that constructing CIs for centrality indices is far from trivial. As discussed in more detail in the Supplementary Materials, both estimating centrality indices based on a sample and bootstrapping centrality indices result in biased sampling distributions, and thus the bootstrap cannot readily be used to construct true 95 % CIs even without regularization. To allow the researcher insight in the accuracy of the found centralities, we suggest to investigate the stability of the order of centrality indices based on subsets of the data. With stability, we indicate if the order of centrality indices remains the same after reestimating the network with less cases or nodes. A case indicates a single observation of all variables (e.g., a person in the dataset) and is represented by rows of the dataset. Nodes, on the other hand, indicate columns of the dataset. Taking subsets of cases in the dataset employs the socalled m out of nbootstrap, which is commonly used to remediate problems with the regular bootstrap (Chernick 2011). Applying this bootstrap for various proportions of cases to drop can be used to assess the correlation between the original centrality indices and those obtained from subsets. If this correlation completely changes after dropping, say, 10 % of the cases, then interpretations of centralities are prone to error. We term this framework the casedropping subset bootstrap. Similarly, one can opt to investigate the stability of centrality indices after dropping nodes from the network (nodedropping subset bootstrap; Costenbader and Valente, 2003), which has also been implemented in bootnet but is harder to interpret (dropping 50 % of the nodes leads to entirely different network structures). As such, we only investigate stability under casedropping, while noting that the below described methods can also be applied to nodedropping.
To quantify the stability of centrality indices using subset bootstraps, we propose a measure we term the correlation stability coefficient, or short, the CScoefficient. Let CS(cor = 0.7) represent the maximum proportion of cases that can be dropped, such that with 95 % probability the correlation between original centrality indices and centrality of networks based on subsets is 0.7 or higher. The value of 0.7 can be changed according to the stability a researcher is interested in, but is set to 0.7 by default as this value has classically been interpreted as indicating a very large effect in the behavioral sciences (Cohen 1977). The simulation study below showed that to interpret centrality differences the CScoefficient should not be below 0.25, and preferably above 0.5. While these cutoff scores emerge as recommendations from this simulation study, however, they are somewhat arbitrary and should not be taken as definite guidelines.
Testing for significant differences
In addition to investigating the accuracy of edge weights and the stability of the order of centrality, researchers may wish to know whether a specific edge A–B is significantly larger than another edge A–C, or whether the centrality of node A is significantly larger than that of node B. To that end, the bootstrapped values can be used to test if two edgeweights or centralities significantly differ from oneanother. This can be done by taking the difference between bootstrap values of one edgeweight or centrality and another edgeweight or centrality, and constructing a bootstrapped CI around those difference scores. This allows for a nullhypothesis test if the edgeweights or centralities differ from oneanother by checking if zero is in the bootstrapped CI (Chernick 2011). We term this test the bootstrapped difference test.
As the bootstraps are functions of complicated estimation methods, in this case LASSO regularization of partial correlation networks based on polychoric correlation matrices, we assessed the performance of the bootstrapped difference test for both edgeweights and centrality indices in two simulation studies below. The edgeweight bootstrapped difference test performs well with Type I error rate close to the significance level (α), although the test is slightly conservative at low sample sizes (i.e., due to edgeweights often being set to zero, the test has a Type I error rate somewhat less than α). When comparing two centrality indices, the test also performs as a valid, albeit somewhat conservative, nullhypothesis test with Type I error rate close to or less than α. However, this test does feature a somewhat lower level of power in rejecting the nullhypothesis when two centralities do differ from oneanother.
A nullhypothesis test, such as the bootstrapped difference test, can only be used as evidence that two values differ from oneanother (and even then care should be taken in interpreting its results; e.g., Cohen 1994). Not rejecting the nullhypothesis, however, does not necessarily constitute evidence for the nullhypothesis being true (Wagenmakers 2007). The slightly lower power of the bootstrapped difference test implies that, at typical sample sizes used in psychological research, the test will tend to find fewer significant differences than actually exist at the population level. Researchers should therefore not routinely take nonsignificant centralities as evidence for centralities being equal to each other, or for the centralities not being accurately estimated. Furthermore, as described below, applying a correction for multiple testing is not feasible in practice. As such, we advise care when interpreting the results of bootstrapped difference tests.
A note on multiple testing
 1.
The distribution of such LASSO regularized parameters is far from normal (Pötscher and Leeb 2009), and as a result approximate pvalues cannot be obtained from the bootstraps. This is particularly important for extreme significance levels that might be used when one wants to test using a correction for multiple testing. It is for this reason that this paper does not mention bootstrapping p values and only investigates nullhypothesis tests by using bootstrapped CIs.
 2.
When using bootstrapped CIs with N_{B} bootstrap samples, the widest interval that can be constructed is the interval between the two most extreme bootstrap values, corresponding to α = 2/N_{B}. With 1,000 bootstrap samples, this corresponds to α = 0.002. Clearly, this value is much higher than 0.000003 mentioned above. Taking the needed number of bootstrap samples for such small significance levels is computationally challenging and not feasible in practice.
 3.
In significance testing there is always interplay of Type I and Type II error rates: when one goes down, the other goes up. As such, reducing the Type I error rate increases the Type II error rate (not rejecting the null when the alternative hypothesis is true), and thus reduces statistical power. In the case of α = 0.000003, even if we could test at this significance level, we would likely find no significant differences due to the low statistical power.
Summary
In sum, the nonparametric (resampling rows from the data with replacement) bootstrap can be used to assess the accuracy of network estimation, by investigating the sampling variability in edgeweights, as well as to test if edgeweights and centrality indices significantly differ from oneanother using the bootstrapped difference test. Casedropping subset bootstrap (dropping rows from the data), on the other hand, can be used to assess the stability of centrality indices, how well the order of centralities are retained after observing only a subset of the data. This stability can be quantified using the CScoefficient. The R code in the Supplementary Materials show examples of these methods on the simulated data in Figs. 1 and 2. As expected from Fig. 1, showing that the true centralities did not differ, bootstrapping reveals that none of the centrality indices in Fig. 2 significantly differ from oneanother. In addition, node strength (CS(cor = 0.7) = 0.08), closeness (CS(cor = 0.7) = 0.05) and betweenness (CS(cor = 0.7) = 0.05) were far below the thresholds that we would consider stable. Thus, the novel bootstrapping methods proposed and implemented here showed that the differences in centrality indices presented in Fig. 2 were not interpretable as true differences.
Tutorial
In this section, we showcase the functionality of the bootnet package for estimating network structures and assessing their accuracy. We do so by analyzing a dataset (N = 359) of women suffering from posttraumatic stress disorder (PTSD) or subthreshold PTSD. The bootnet package includes the bootstrapping methods, CScoefficient and bootstrapped difference tests as described above. In addition, bootnet offers a wide range of plotting methods. After estimating nonparametric bootstraps, bootnet produces plots that show the bootstrapped CIs of edgeweights or which edges and centrality indices significantly differ from oneanother. After estimating subset bootstrap, bootnet produces plots that show the correlation of centrality indices under different levels of subsetting (Costenbader and Valente 2003). In addition to the correlation plot, bootnet can be used to plot the average estimated centrality index for each node under different sampling levels, giving more detail on the order of centrality under different subsetting levels.
With bootnet, users can not only perform accuracy and stability tests, but also flexibly estimate a wide variety of network models in R. The estimation technique can be specified as a chain of R commands, taking the data as input and returning a network as output. In bootnet, this chain is broken in several phases: data preparation (e.g., correlating or binarizing), model estimation (e.g., glasso) and network selection. The bootnet package has several default sets, which can be assigned using the default argument in several functions. These default sets can be used to easily specify the most commonly used network estimation procedures. Table 1 gives an overview of the default sets and the corresponding R functions called.^{6}
R chains to estimate network models from data. The default sets "EBICglasso", "pcor", "huge" and "adalasso" estimate a Gaussian graphical model and the default sets "IsingFit" and "IsingLL" estimate the Ising model. The notation package::function indicates that the function after the colons comes from the package before the colons. Chains are schematically represented using magrittr chains: Whatever is on the left of %>% is used as first argument to the function on the right of this operator. Thus, the first chain corresponding to "EBICglasso" can also be read as qgraph::EBICglasso(qgraph::cor_auto(Data))
Default set  R chain 

EBICglasso  Data%>% qgraph::cor_auto %>%qgraph::EBICglasso 
pcor  Data%>% qgraph::cor_auto %>%corpcor::cor2pcor 
IsingFit  Data%>% bootnet::binarize %>%IsingFit::IsingFit 
IsingLL  Data%>% bootnet::binarize % > % 
IsingSampler::EstimateIsing(method = ‘‘ll'')  
huge  Data%>% as.matrix %>% na.omit %>% huge::huge.npn % > % 
huge::huge(method = ‘‘glasso'') % > %  
huge::huge.select(criterion = ‘‘ebic'')  
adalasso  Data%>%parcor::adalasso.net 
Example: posttraumatic stress disorder
To exemplify the usage of bootnet in both estimating and investigating network structures, we use a dataset of 359 women enrolled in communitybased substance abuse treatment programs across the United States (study title: Women’s Treatment for Trauma and Substance Use Disorders; study number: NIDACTN0015).^{7} All participants met the criteria for either PTSD or subthreshold PTSD, according to the DSMIVTR (American Psychiatric Association 2000). Details of the sample, such as inclusion and exclusion criteria as well as demographic variables, can be found elsewhere (Hien et al., 2009). We estimate the network using the 17 PTSD symptoms from the PTSD Symptom ScaleSelf Report (PSSSR; Foa et al., 1993). Participants rated the frequency of endorsing these symptoms on a scale ranging from 0 (not at all) to 3 (at least 4 or 5 times a week).
Network estimation
Node IDs and corresponding symptom names of the 17 PTSD symptoms
ID  Variable 

1  Avoid reminds of the trauma 
2  Bad dreams about the trauma 
3  Being jumpy or easily startled 
4  Being over alert 
5  Distant or cut off from people 
6  Feeling emotionally numb 
7  Feeling irritable 
8  Feeling plans won’t come true 
9  Having trouble concentrating 
10  Having trouble sleeping 
11  Less interest in activities 
12  Not able to remember 
13  Not thinking about trauma 
14  Physical reactions 
15  Reliving the trauma 
16  Upset when reminded of trauma 
17  Upsetting thoughts or images 
Computing centrality indices
To investigate centrality indices in the network, we can use the centralityPlot function from the qgraph package:library("qgraph") centralityPlot(Network)
The resulting plot is shown in Fig. 3 (right panel). It can be seen that nodes differ quite substantially in their centrality estimates. In the network, Node 17 (upsetting thoughts/images) has the highest strength and betweenness and Node 3 (being jumpy) has the highest closeness. However, without knowing the accuracy of the network structure and the stability of the centrality estimates, we cannot conclude whether the differences of centrality estimates are interpretable or not.
Edgeweight accuracy
The bootnet function can be used to perform the bootstrapping methods described above. The function can be used in the same way as the estimateNetwork function, or can take the output of the estimateNetwork function to run the bootstrap using the same arguments. By default, the nonparametric bootstrap with 1,000 samples will be used. This can be overwritten using the nBoots argument, which is used below to obtain more smooth plots.^{8} The nCores argument can be used to speed up bootstrapping and use multiple computer cores (here, eight cores are used):boot1 <  bootnet(Network, nBoots = 2500, nCores = 8)
The print method of this object gives an overview of characteristics of the sample network (e.g., the number of estimated edges) and tips for further investigation, such as how to plot the estimated sample network or any of the bootstrapped networks. The summary method can be used to create a summary table of certain statistics containing quantiles of the bootstraps.
The plot method can be used to show the bootstrapped CIs for estimated edge parameters:plot(boot1, labels = FALSE, order = "sample")
Centrality stability
We can now investigate the stability of centrality indices by estimating network models based on subsets of the data. The casedropping bootstrap can be used by using type = "case":boot2 <  bootnet(Network, nBoots = 2500, type = "case", nCores = 8)
To plot the stability of centrality under subsetting, the plot method can again be used:plot(boot2)
The CScoefficient indicates that betweenness (CS(cor = 0.7) = 0.05) and (CS(cor = 0.7) = 0.05) closeness are not stable under subsetting cases. Node strength performs better (CS(cor = 0.7) = 0.44), but does not reach the cutoff of 0.5 from our simulation study required consider the metric stable. Therefore, we conclude that the order of node strength is interpretable with some care, while the orders of betweenness and closeness are not.
Testing for significant differences
The differenceTest function can be used to compare edgeweights and centralities using the bootstrapped difference test. This makes use of the nonparametric bootstrap results (here named boot1) rather than the casedropping bootstrap results. For example, the following code tests if Node 3 and Node 17 differ in node strength centrality:differenceTest(boot1, 3, 17, "strength")
The results show that these nodes do not differ in node strength since the bootstrapped CI includes zero (CI: −0.20,0.35). The plot method can be used to plot the difference tests between all pairs of edges and centrality indices. For example, the following code plots the difference tests of node strength between all pairs of edgeweights:plot(boot1, "edge", plot = "difference", onlyNonZero = TRUE, order = "sample")
In which the plot argument has to be used because the function normally defaults to plotting bootstrapped CIs for edgeweights, the onlyNonZero argument sets so that only edges are shown that are nonzero in the estimated network, and order = "sample" orders the edgeweights from the most positive to the most negative edgeweight in the sample network. We can use a similar code for comparing node strength:plot(boot1, "strength")
In which we did not have to specify the plot argument as it is set to the "difference" by default when the statistic is a centrality index.
Simulation studies
We conducted three simulation studies to assess the performance of the methods described above. In particular, we investigated the performance of (1) the CScoefficient and the bootstrapped difference test for (2) edgeweights and (3) centrality indices. All simulation studies use networks of 10 nodes. The networks were used as partial correlation matrices to generate multivariate normal data, which were subsequently made ordinal with four levels by drawing random thresholds; we did so because most prior network papers estimated networks on ordinal data (e.g., psychopathological symptom data). We varied sample size between 100, 250, 500, 1,000, 2,500 and 5,000, and replicated every condition 1,000 times. We estimated Gaussian graphical models, using the graphical LASSO in combination with EBIC model selection (Epskamp and Fried 2016; Foygel and Drton 2010), using polychoric correlation matrices as input. Each bootstrap method used 1,000 bootstrap samples. In addition, we replicated every simulation study with 5node and 20node networks as well, which showed similar results and were thus not included in this paper to improve clarity.
CScoefficients
Edgeweight bootstrapped difference test
Centrality bootstrapped difference test
Discussion
In this paper, we have summarized the stateoftheart in psychometric network modeling, provided a rationale for investigating how susceptible estimated psychological networks are to sampling variation, and described several methods that can be applied after estimating a network structure to check the accuracy and stability of the results. We proposed to perform these checks in three steps: (A) assess the accuracy of estimated edgeweights, (B) assess the stability of centrality indices after subsetting the data, and (C) test if edgeweights and centralities differ from oneanother. Bootstrapping procedures can be used to perform these steps. While bootstrapping edgeweights is straightforward, we also introduced two new statistical methods: the correlation stability coefficient (CScoefficient) and the bootstrapped difference test for edgeweights and centrality indices to aid in steps 2 and 3, respectively. To help researchers conduct these analyses, we have developed the freely available R package bootnet, which acts as a generalized framework for estimating network models as well as performs the accuracy tests outlined in this paper. It is of note that, while we demonstrate the functionality of bootnet in this tutorial using a Gaussian graphical model, the package can be used for any estimation technique in R that estimates an undirected network (such as the Ising model with binary variables).
Empirical example results
The accuracy analysis of a 17node symptom network of 359 women with (subthreshold) PTSD showed a network that was susceptible to sampling variation. First, the bootstrapped confidence intervals of the majority of edgeweights were large. Second, we assessed the stability of centrality indices under dropping people from the dataset, which showed that only node strength centrality was moderately stable; betweenness and closeness centrality were not. This means that the order of node strength centrality was somewhat interpretable, although such interpretation should be done with care. Finally, bootstrapped difference tests at a significance level of 0.05 indicated that only in investigating node strength could statistical differences be detected between centralities of nodes, and only three edgeweights were shown to be significantly higher than most other edges in the network.
Limitations and future directions
Poweranalysis in psychological networks
Overall, we see that networks with increasing sample size are estimated more accurately. This makes it easier to detect differences between centrality estimates, and also increases the stability of the order of centrality estimates. But how many observations are needed to estimate a reasonably stable network? This important question usually referred to as poweranalysis in other fields of statistics (Cohen 1977) is largely unanswered for psychological networks. When a reasonable prior guess of the network structure is available, a researcher might opt to use the parametric bootstrap, which has also been implemented in bootnet, to investigate the expected accuracy of edgeweights and centrality indices under different sample sizes. However, as the field of psychological networks is still young, such guesses are currently hard to come by. As more network research will be done in psychology, more knowledge will become available on graph structure and edgeweights that can be expected in various fields of psychology. As such, power calculations are a topic for future research and are beyond the scope of the current paper.
Future directions
While working on this project, two new research questions emerged: is it possible to form an unbiased estimator for centrality indices in partial correlation networks, and consequently, how should true 95% confidence intervals around centrality indices be constructed? As our example highlighted, centrality indices can be highly unstable due to sampling variation, and the estimated sampling distribution of centrality indices can be severely biased. At present, we have no definite answer to these pressing questions that we discuss in some more detail in the Supplementary Materials. In addition, constructing bootstrapped CIs on very low significance levels is not feasible with a limited number of bootstrap samples, and approximating pvalues on especially networks estimated using regularization is problematic. As a result, performing difference tests while controlling for multiple testing is still a topic of future research. Given the current emergence of network modeling in psychology, remediating these questions should have high priority.
Related research questions
We only focused on accuracy analysis of crosssectional network models. Assessing variability on longitudinal and multilevel models is more complicated and beyond the scope of current paper; it is also not implemented in bootnet as of yet. We refer the reader to Bringmann et al. (2015) for a demonstration on how confidence intervals can be obtained in a longitudinal multilevel setting. We also want to point out that the results obtained here may be idiosyncratic to the particular data used. In addition, it is important to note that the bootstrapped edgeweights should not be used as a method for comparing networks based on different groups, (e.g., comparing the bootstrapped CI of an edge in one network to the bootstrapped CI of the same edge in another network) for which a statistical test is being developed.^{10} Finally, we wish to point out promising research on obtaining exact p values and confidence intervals based on the results of LASSO regularized analyses (see Hastie et al. (2015), for an overview), which may in the future lead to a lesser need to rely on bootstrapping methods.
Conclusions
In addition to providing a framework for network estimation as well as performing the accuracy tests proposed in this paper, bootnet offers more functionality to further check the accuracy and stability of results that were beyond the scope of this paper, such as the parametric bootstrap, nodedropping bootstrap (Costenbader and Valente 2003) and plots of centrality indices of each node under different levels of subsetting. Future development of bootnet will be aimed to implement functionality for a broader range of network models, and we encourage readers to submit any such ideas or feedback to the Github Repository.^{11} Network accuracy has been a blind spot in psychological network analysis, and the authors are aware of only one prior paper that has examined network accuracy (Fried et al. 2016), which used an earlier version of bootnet than the version described here. Further remediating the blind spot of network accuracy is of utmost importance if network analysis is to be added as a fullfledged methodology to the toolbox of the psychological researcher.
An introduction on the interpretation and inference of network models has been included in the Supplementary Materials.
Penalized maximum likelihood estimation used in this analysis typically leads to slightly lower parameter estimates on average. As a result, the absolute edgeweights in Fig. 2 are all closer to zero than the absolute edgeweights in the true network in Fig. 1.
CRAN link: http://cran.rproject.org/package=bootnet Github link (developmental): http://www.github.com/SachaEpskamp/bootnet
While the GGM requires a covariance matrix as input, it is important to note that the model itself is based on the (possibly sparse) inverse of the covariance matrix. Therefore, the network shown does not show marginal correlations (regular correlation coefficients between two variables). The inverse covariance matrix instead encodes partial correlations.
One might instead only test for difference in edges that were estimated to be nonzero with the LASSO. However, doing so still often leads to a large number of tests.
Using many bootstrap samples, such as the 2,500 used here, might result in memory problems or long computation time. It is advisable to first use a small number of samples (e.g., 10) and then try more. The simulations below show that 1,000 samples may often be sufficient.
As with any CI, nonoverlapping CIs indicate two statistics significantly differ at the given significance level. The reverse is not true; statistics with overlapping CIs might still significantly differ.
Supplementary material
Funding information
Funder Name  Grant Number  Funding Note 

NWO "research talent" 

Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.