Introduction

In recent years the availability of high-dimensional biological data has rapidly outpaced the tools available to understand it. Here, we describe an algorithm for analyzing such data using an embedded feature selection method to infer networks of mechanistic biological relationships. These networks summarize the totality of the information available in the data, and can be used to help answer many of the critical questions across drug discovery and development: which biomarkers and targets to select, what differences in biology between pre-clinical species exist, what is a drug’s mechanism of action, and how do we identify the best combinations of treatments?

The best performing network inference methods have a significant liability, however, which we address with our new approach. These methods, such as GENIE3 (the best performing method in the ‘Dialogue for Reverse Engineering Assessments and Methods’ initiative’s challenges), do not explicitly infer networks but rather rank all possible relationships. We refer to these methods as ‘weight-focused.’ To define a network from the ranked lists produced by these weight-focused methods, the user must identify subjectively an unknown threshold to exclude some number of lowest ranked edges to define a final network. A common approach in the literature is to take the top million relationships, which depending on the number of features in the dataset represents anywhere from the top 0.2 to 2% of ranked relationships [8, 14, 27, 34]. The basis for selecting these thresholding values has not been well established or evaluated in benchmarking exercises such as DREAM, but has a dramatic effect on the outcome of the analysis. A high threshold risks omitting informative relationships in the data, a low threshold risks including too many errors and making meaningful interpretation impossible, and little to no information exists defining what is a ‘high’ or ‘low’ threshold for any given data type or application. Our method uses a different approach to feature selection and network inference to avoid this hand-tuned thresholding step, instead identifying the subset of relationships to form a network directly.

The network inference algorithm we describe here follows the iterative feature selection approach used by other high-performing network inference methods [10, 16], which is a natural extension of the general feature selection approach often used to analyze high-dimensional data. All feature selection approaches answer a common question: which features are meaningful predictors of a specified response? This is particularly critical for high-dimensional data, such as multi-omics datasets which combine measurements from many biological domains and analytical technologies (’omes). These datasets commonly measure thousands to hundreds of thousands of independent biological features, introducing the ‘curse of dimensionality’ [18] and violating the assumption that there are many more samples than measured quantities: this is central to many common statistical approaches [17]. Feature selection enables interpretation of high-dimensional data by identifying a subset of features which is most useful for predicting the response (predictors), thereby reducing the dimensions of the data to a scale that can be handled more easily.

Network inference can be done by reapplying this general approach of feature selection targeting each feature of the dataset separately as a response to be predicted. Through this process not only are the meaningful predictors of a response identified, but the meaningful predictors of meaningful predictors, and so on. The results of this iterative feature selection process are visualized as a network, with a predictive relationship between a selected feature and response represented as a connection (edge) between features (vertices). The network approach thus provides the same answers as a general feature selection approach focused on a specific response of interest would, but also provides additional information critical to understanding the wider biological context.

To illustrate the utility that network inference adds, consider an analysis in which a single protein is found to predict an important clinical endpoint. This information alone suggests the protein might be a good target to modulate the clinical endpoint. However, this analysis does not describe what biological features may be acting upon the prospective protein target. Homeostatic control mechanisms are a central feature of biology, and without identifying and inhibiting these control mechanisms the single protein target might be impossible to influence. If a network inference approach is used instead, all the upstream control mechanisms can be considered when making the choice of target or biomarker, avoiding costly mistakes due to unknown biology. This wider context also provides additional choices of target or biomarker, which may be easier to drug or measure, or may identify a combination of targets necessary to achieve the best efficacy. This approach can also resolve challenges in translation by inferring networks for different species and comparing differences in biology. Fundamentally, network inference enables a holistic view of the wealth of information in high dimensional data to provide a deeper understanding of biology.

Though network inference is a promising approach in principle, realizing this promise in practice depends upon the methods used. Reviews of benchmarking challenges using real and realistic simulated data [20] have shown that iterative feature selection methods outperform the alternatives, excluding methods using time series or knock-up/knock-down experiment data unavailable in clinical settings. One reason iterative feature selection methods are able to better identify true relationships is that, in comparison to simpler pair-wise or univariate methods such as correlation or mutual information scores, these methods model the multivariate effect of all predictors on a given response and thus capture interactions between predictors. Among iterative feature selection methods, GENIE3, a random forest based method [16], has shown the best individual performance overall particularly when tested on real data. This high performance can be understood from the strengths of random forest [3] as a tool for feature selection: it is able to identify nonlinear relationships and does not make assumptions about the distribution of the data.

The method we propose here is intended to retain the strengths shown by the GENIE3 algorithm (iterative feature selection with a nonlinear, nonparametric method) while addressing the threshold-dependence of these methods, as discussed above. Our method addresses this limitation by using a bespoke machine learning method for nonlinear, nonparametric feature selection [31] to identify sparse subsets of meaningful features. Rather than ranking all features as candidate predictors for a given response, this method excludes any features which do not improve prediction of the response. As a result, the method identifies a network with no additional need for thresholding. Additionally, we use this ‘subset-focused’ method in parallel with the weight-focused GENIE3 method to both identify a specific network from the data and accurately rank relationships within the network based on confidence. We name our implementation of the feature selection approach ‘DEKER’ for decomposed kernel regression, the core principle of the original method, and the network inference approach ‘DEKER-NET.’ We describe the methodology briefly in Methods and in complete detail in the ‘Appendix’, and illustrate the performance of DEKER-NET on the benchmarking dataset from the DREAM4 challenge [7] in ‘Results’. A C++ library implementation of the complete method is available at github.com/Merck/deker.

Methods

DEKER-NET incorporates several novel improvements to enable subset-focused network inference. The core of the methodology is the feature selection method, which was originally described by Sun et al. [31]. This method is described briefly in ‘Nonlinear feature selection’, below. To best leverage this method for network inference, we implement a version with a novel strategy for selecting hyperparameters, DEKER, described in the ‘Appendix’. Additionally, in the process of assembling the features inferred by DEKER for network inference in DEKER-NET, we use GENIE3’s approach to weight edges. The complementarity between DEKER-NET and GENIE3 improves on both subset selection by DEKER and edge weighting by random forest in GENIE3. This process is described in ‘Network inference using DEKER-NET’, with a complete algorithm in the ‘Appendix’.

Nonlinear feature selection

The feature selection method originally described by Sun et al. [31] was chosen for DEKER-NET based on the characteristics of high-performing methods in the DREAM4 challenge [20]. Specifically, the method models nonlinear, multivariate relationships without assumptions about the underlying distribution of data. Additionally, the method selects a sparse subset of features, which as discussed above differentiates it from the weight-focused methods used in other network inference approaches.

Sun et al.’s feature selection method is based on the principles of kernel regression which uses the nearest neighbors of each sample or observation to predict response. The influence of each neighbor and the size of the neighborhood are defined by a kernel function. To illustrate, take a dataset \(\mathbb {D} = \{(\mathbf {x}_i,y_i)\}^{N_s}\), with paired samples \(\mathbf {x}_i\) and responses \(y_i\), where \(N_s\) is the number of samples. Each sample \(\mathbf {x}_i\) is a vector of length \(N_j\), the number of features, where \(\mathbf {x}_i \in \mathbb {R}^{N_j}\) and \(y_i \in \mathbb {R}\). A held-out response \(y_*\) is predicted using its sample \(x_*\) and the other samples and responses in the dataset using the kernel function f.

$$\begin{aligned}&\hat{y}_* = \frac{\sum ^{N_s}_{i = 1} f(\mathbf {x}_*,\mathbf {x}_i) y_i}{\sum ^{N_s}_{i = 1} f(\mathbf {x}_*,\mathbf {x}_i)} \end{aligned}$$
(1)

A common choice of kernel function f is the squared exponential kernel, which introduces a characteristic lengthscale hyperparameter \(l_f\) that controls the size of the neighborhood used for predicting each sample.

$$\begin{aligned} f(\mathbf {x}_*,\mathbf {x}_i) = \mathrm {exp}\left(- \frac{\Vert \mathbf {x}_*-\mathbf {x}_i \Vert _2}{2 l_f^2}\right) \end{aligned}$$
(2)

Kernel regression can be used for feature selection by adding weights \(\mathbf {w} \in \mathbb {R}^{N_j}\) which scale each feature in \(x_{i,j}\), adjusting the relative contribution of each feature to the prediction or removing the feature entirely when \(w_j = 0\),

$$\begin{aligned} \hat{y}_* = \frac{\sum ^{N_s}_{i = 1} f(\mathbf {w} \circ \mathbf {x}_*,\mathbf {w} \circ \mathbf {x}_i) y_i}{\sum ^{N_s}_{i = 1} f(\mathbf {w} \circ \mathbf {x}_*,\mathbf {w} \circ \mathbf {x}_i)} \end{aligned}$$
(3)

where \(\circ\) is the Hadamard operator for element-wise multiplication between vectors. With the weights introduced, the feature selection problem can be stated as finding the weights that minimize the prediction error between the true held-out response \(y_*\) and the prediction \(\hat{y}_*\). For example the RGS algorithm [24], a specific implementation of kernel regression for feature selection, uses sum of squares for prediction error and holds out each response in the dataset (leave one out cross-validation, LOOCV).

$$\begin{aligned} \begin{aligned} \min _{\mathbf {w}} \quad&\sum ^{N_s}_{j = 1} \frac{1}{2}\left( y_j - \frac{\sum _{i \in \{1 \dots N_s \setminus j \} } f(\mathbf {w} \circ \mathbf {x}_i,\mathbf {w} \circ \mathbf {x}_j) y_i}{\sum _{i \in \{1 \dots N_s \setminus j\}} f(\mathbf {w} \circ \mathbf {x}_i,\mathbf {w} \circ \mathbf {x}_j)}\right) ^2 \\ \text {s.t.} \quad&\mathbf {w} \ge 0 \end{aligned} \end{aligned}$$
(4)

Sun et al. observe however that this formulation of the feature selection problem is very difficult to solve numerically as the minimization problem has many local minima and as a result the ideal performance of the algorithm cannot be realized.

As an alternative, Sun et al. decompose the kernel regression problem into a series of classification problems with a quasi-convex objective function which can be reliably solved for the global minimum. The core idea is that any regression problem can be re-interpreted as a joint set of classification problems. Specifically, a set of thresholds \(\mathbf {t} \in \mathbb {R}^{N_t}\) is defined to divide the samples of the dataset \(\mathbb {D}\) into two sets. Sun et al. use every response in the dataset as a threshold, as will we, though it is worth noting that fewer thresholds (ex. every other response, every third, etc.) could be used to speed computation at the cost of some performance. For a given threshold value \(t_k\), samples are divided into subsets \(\mathbb {T}_{< t_k}\) and \(\mathbb {T}_{\ge t_k}\) based on their responses \(y_i\).

$$\begin{aligned} \begin{aligned} \mathbb {T}_{< t_k}&= \{i \vert y_i < t_k, i \in \{1 \dots N_S \} \} \\ \mathbb {T}_{\ge t_k}&= \{ i \vert y_i \ge t_k, i \in \{1 \dots N_S \} \} \end{aligned} \end{aligned}$$
(5)

Samples can then be classified in the same way response values are predicted in kernel regression: using the kernel-weighted distance of each sample’s predictors to the predictors in each subset. The performance of the classifier is then assessed based on the difference between the kernel-weighted distance to the correct subset and the incorrect subset based on the sample’s true response, which is called the margin m.

$$\begin{aligned} \begin{aligned} m(i,k)&= y^c_{i,k} \left( \sum _{j \in \mathbb {T}_{< t_k}} \Bigg[\frac{f(\mathbf {w} \circ \mathbf {x}_i,\mathbf {w} \circ \mathbf {x}_j)}{\sum _{i \in \mathbb {T}_{< t_k}} f(\mathbf {w} \circ \mathbf {x}_i,\mathbf {w} \circ \mathbf {x}_j)} \right. \\&\qquad \qquad \qquad \qquad * |\mathbf {w} \circ (\mathbf {x}_i-\mathbf {x}_j) | \Bigg] \\&\qquad - \sum _{j \in \mathbb {T}_{\ge t_k}} \Bigg[ \frac{f(\mathbf {w} \circ \mathbf {x}_i,\mathbf {w} \circ \mathbf {x}_j)}{\sum _{i \in \mathbb {T}_{\ge t_m}} f(\mathbf {w} \circ \mathbf {x}_i,\mathbf {w} \circ \mathbf {x}_j)} \\&\qquad \qquad \qquad \qquad * |\mathbf {w} \circ (\mathbf {x}_i-\mathbf {x}_j) |\Bigg]\Bigg) \\ y^{c}_{i,k}&= {\left\{ \begin{array}{ll}\! -1 &{} y_i < t_k \\ 1 &{} y_i \ge t_k \end{array}\right. } \end{aligned} \end{aligned}$$
(6)

This formulation can be used to construct a minimization problem equivalent to Eq. (4). Sun et al. specifically use L1 distance to simplify the calculation of derivatives for minimization. Additionally, Sun et al. include a penalty based on feature weights \(\mathbf {w}\), as is done in LASSO regression [32] to impose sparsity. This penalty ensures that only a subset of predictive features with unique predictive power is selected, excluding uninformative features or features that are correlated to true predictors and not uniquely informative. The penalty also adds another hyperparameter \(\lambda\), which controls the strength of the penalty. The resulting objective function to be minimized is

$$\begin{aligned} \min _{\mathbf {w}}&\sum ^{N_s}_{i=1} \sum _{k=1}^{N_t} m(i,k) + \lambda \sum ^{N_J}_{j=1} w_j \\ \text {s.t.} \quad&\mathbf {w} \ge 0 \end{aligned}$$
(7)

In their original description of the method [31], Sun et al. demonstrate superior performance to many other modern machine learning methods, including RGS [31].

When implementing this method, Sun et al. do not suggest an approach for selecting values of the two introduced hyperparameters, the characteristic lengthscale in the kernel function \(l_f\) and penalty weight \(\lambda\). Hyperparameters are distinguished from ordinary parameters learned from data, such as the weights \(\mathbf {w}\), as they control the learning process itself and must be determined separately to prevent overfitting. Sun et al. test a range of hyperparameter values in their benchmarking exercise and conclude that the influence of the values on the features selected is relatively minor, recommending that values are fixed a priori. However we found while benchmarking using the DREAM4 challenge data [7] the features selected by the algorithm are strongly influenced by the values of the hyperparameters and fixing the values a priori can lead to very poor performance. The DREAM4 challenge data represent a larger and more varied set of test cases compared to Sun et al.’s original benchmarking, including many relationships between predictor and response that are weak relative to noise. An appropriate strategy for selecting hyperparameters thus appears critical to the method’s performance on realistic data for network inference.

A standard approach to selecting hyperparameter values would be to use nested cross-validation in which data are held out from the learning process and used to estimate how well selected features would predict response values on new data, repeated for combinations of hyperparameter values until an optimum value is found [6]. We found however that the feature selection is fairly unstable in this case. For example, using ten-fold cross validation in which a unique tenth of the dataset is held out and the algorithm selects features on the ten overlapping but unique datasets for a given value of \(l_f\) and \(\lambda\), five folds may select one feature, three folds may select two features, and two folds may select five features. Worse still, we observed that the features selected on each fold with 90% of the total data often do not reflect the features selected when the algorithm is run with all of the data. These liabilities substantially reduced the performance of Sun et al.’s feature selection in our tests.

To address these liabilities, we use another approach to estimate the performance of selected features on new data and identify optimum values of hyperparameters. Specifically, we calculate Bayesian Information Criterion (BIC) [30] values for prediction performance by estimating the degrees of freedom of a trained model of selected features [9] and using Platt scaling [19, 26] to condition margin values into probabilities suitable for calculating the log-likelihood of the predictions (details in ‘Appendix’). Additionally, we define a null BIC value to compare performance against. If a model’s BIC is higher than the null value, the features selected are uninformative and are excluded from the inferred network.

A complete description of the methodology is provided in the ‘Appendix’, including additional notes on practical considerations for our implementation of Sun et al.’s methodology, DEKER. Our implementation of DEKER is available at github.com/Merck/deker and can be used as either a stand-alone executable intended for use in a high-performance computing environment, or as a header-only C++ library for development.

Network inference using DEKER-NET

In network inference by iterative feature selection, an iteration of feature selection is done to identify the predictors of each unique feature in the dataset. The results of a single iteration of feature selection can be visualized as a simple network: the response and each identified predictor is represented by a vertex, and edges are directed from each predictor to the response. The combined output of all feature selection iterations is the union of each of these simple networks: each feature in the dataset is represented by a vertex, which are connected by edges indicating other features which a given feature was either found to predict or found to be predicted by using feature selection. A simple example of this assembly is illustrated in Fig. 1. This network can also be represented as an adjacency matrix, \(\mathbf {A}\), where each element \(A_{i,j}\) indicates whether feature j is a predictor of feature i (\(A_{i,j} > 0\)) or not (\(A_{i,j}=0\)).

Fig. 1
figure 1

Illustrating network assembly. On the left, iterations of feature selection for responses ad are visualized as networks. On the right is the network obtained by combining the relationships identified in the iterations of feature selection. Networks are illustrated without edge directionality for consistency; see ‘Results’ main text regarding directed vs. undirected network inference

Weight-focused network inference methods such as GENIE3 [16] output a network where all vertices are connected, or \(A_{i,j} > 0\) for all ij. A sparse network can only by identified from this weighting by excluding edges with weights below an unknown threshold. When the true network structure is known the performance of these methods is assessed based on the error rates of the range of networks constructed by eliminating edges from the lowest to highest weight. In particular, the precision, or ratio of true positive to total inferred edges, and recall, or ratio of true positive to total edges in the true network, are used to capture the trade off between including the most true edges and excluding the most false edges. For a weight-focused method, ideal performance is realized if there exists a weight threshold which separates all true edges from all false edges. Even in this ideal case however, there is no method for identifying the optimal threshold in the practical case where the true network structure is unknown.

By contrast, iteratively applying Sun et al.’s feature selection outputs a single sparse network, where only a minority of possible edges \(A_{i,j}\) are weighted. In practice, while weight-focused methods are concerned with a ranking of possible networks, a subset-focused method is concerned with identifying a single network structure for analysis and interpretation. There is a trade off, however. It is still desirable to rank edges within the network inferred by DEKER-NET, however the non-zero weights produced by Sun et al.’s feature selection are not comparable across models. As the network is constructed by combining all models, these weights cannot be used to rank edges.

Fortunately, weight-focused and subset-focused methods are complementary. For DEKER-NET, we use DEKER to select the sparse network while taking the edge weights from GENIE3. Compared to using GENIE3 alone, DEKER-NET explicitly determines which edges to exclude, including the majority of GENIE3’s low weighted edges as well as some highly weighted false-positive edges. Additionally, DEKER sometimes includes edges which are ranked very low by GENIE3. These edges are also excluded from the final network, further capitalizing on the complementarity between DEKER’s sparse feature selection and GENIE3’s weighting. As a result, DEKER-NET can often realize the best performance of GENIE3 while identifying a sparse, interpretable network. An algorithm detailing the combination of methods and network construction is provided in the ‘Appendix’.

Results

To illustrate the performance of DEKER-NET, we use a common benchmarking approach on simulated multifactorial data. In multifactorial data, each sample is simulated from a network with multiple unknown perturbations relative to a baseline. These data imitate the available clinical data: each sampled patient has multiple unique, unknown differences in underlying biology relative to the others. Inferring networks from this type of data is much more challenging compared to precisely controlled gene knock-up or knock-down data, or time-series data with multiple observations from the same patient. Nevertheless, this type of data is easiest to acquire from real patients and best represents the real use-case of this methodology.

Alongside DEKER-NET, we test ‘Correlation’ network inference, which uses the absolute value of the Pearson correlation coefficient to weight edges, and two of the highest performing network inference methods, TIGRESS [10] and GENIE3 [16]. The Correlation approach is presented to illustrate worst-case performance, as this is the simplest to implement and most prone to errors. Results for TIGRESS and GENIE3 were generated using the R packages with default values for each [15, 33], with the exception of the K parameter for GENIE3 which was changed to ‘all’ as the authors suggest in [16].

While all methods (except Correlation) are able to infer the directionality of edges, in practice the results are relatively poor when applied to multifactorial data compared to undirected inference [16]. Therefore in these tests we consider only performance on identifying undirected structure. For example, if the underlying true network has the relationship \(a \rightarrow b\), we accept either the edge \(a \leftarrow b\) or \(a \rightarrow b\) as true edges, using highest weight each directional edge to define the undirected edge \(a \leftrightarrow b\).

As discussed in ‘Methods’, a common approach to assessing performance of network inference methods uses precision (ratio of true edges inferred to total edges inferred) and recall (ratio of true edges inferred to total edges in true network). In particular, each point on a precision-recall curve is generated by iteratively removing the lowest-ranked edge and recalculating the precision and recall values of the resulting network for any given network inference method. The most common error we observe during benchmarking is an indirect relationship inferred as a direct relationship. To illustrate such an indirect effect error, take a triplet such as \(a \leftrightarrow b \leftrightarrow c\). If the network inference method infers a relationship between \(a \leftrightarrow c\), it is an indirect effect error. These errors have relatively little effect on the structure of the inferred network, however: a and c are closely related in the true network structure, and an erroneous edge has a minor effect on network topology and the resulting interpretation. By contrast, an error connecting two features which are very distant in the true network has a much more serious effect both on the topology of the inferred network and the biological interpretation. Equating indirect effect errors and more serious errors therefore gives an overly pessimistic view of the performance of these methods. A more optimistic version of the standard precision-recall curve can be identified by excluding these indirect effect errors: for a given point, recall values remains the same while precision may improve. We visualize this by plotting our precision-recall curves as an area bounded by an upper curve corresponding to the precision and recall values calculated omitting indirect effect errors and a lower curve corresponding to precision and recall values including indirect effect errors.

In the first subsection, we apply DEKER-NET to the ‘Dialogue for Reverse Engineering Assessments and Methods’ initiative’s fourth (DREAM4) challenge on in silico multifactorial data [7]. DREAM4 challenge data have previously been used to assess the performance of a number of a variety of different network inference methods and provides a common reference point for comparison. In the second subsection, we perform the same tests on datasets with an increasing numbers of samples simulated from the DREAM4 challenge networks [29] to illustrate the sensitivity of our method to data availability and set expectations for performance on datasets of different sizes.

Benchmarking with DREAM4 challenge data

In Fig. 2 we show the precision-recall curves for each of the five networks in the DREAM4 in silico multifactorial dataset with 100 features and 100 samples per network. We plot both the curve for the raw precision values and precision ignoring indirect effect errors. Note that while the weight-focused Correlation, GENIE3, and TIGRESS methods have curves spanning the full range of recall values, the curve for our subset-focused DEKER-NET method ends abruptly with precision values at or greater than  70%. As discussed in ‘Methods’, this is by design as the goal of DEKER-NET is to identify a sparse network, while the weight-focused methods rank all possible relationships. DEKER-NET generally shows a minor improvement over other methods in terms of precision and recall, including GENIE3.

Fig. 2
figure 2

Precision-recall curves for DREAM4 in silico multifactorial challenge data show DEKER-NET is generally nearly equal to or better than GENIE3 performance. Precision-recall curves are constructed from the series of networks generated by iteratively removing the lowest-weighted edge of each network and calculating precision (ratio of true inferred edges to total inferred eges) and recall (ratio of true inferred edges to total edges in the true network). Each method is plotted as an area bounded by an upper curve corresponding to the precision and recall values calculated omitting indirect effect errors and a lower curve corresponding to precision and recall values including indirect effect errors; see ‘Results’ main text for details on indirect effect errors. An ideal precision-recall curve would maintain a values of 1 for precision accross all values of recall. One curve (method) is considered superior to another if its curve has a higher precision across a relevant range of recall values (Color figure online)

In Fig. 3 we show the distribution of edge weights among true edges, indirect error edges, and other error edges. This figure illustrates how effectively each method separates true edges from error edges by edge weight. For DEKER-NET, GENIE3, and TIGRESS, the majority of errors are indirect, and other errors separate fairly well from true or indirect error edges by weight. Overall, DEKER-NET offers a marginal improvement in separating errors from true edges by weight over GENIE3 by eliminating a few incorrect edges that would otherwise be highly weighted by GENIE3 while otherwise retaining the same weights and edge ranking.

Fig. 3
figure 3

DEKER-NET shows nearly equal or better separation between true and false edges by relative edge weight compared to other methods for DREAM4 in silico multifactorial challenge data. For DEKER, GENEI3, and TIGRESS the majority of false edges are indirect false edges which have a lower impact on network structure than other false edges. Edge weights are rescaled to [0, 1] to enable comparison between methods with differing intrinsic scales. Distribution of edge weights for each edge type are illustrated as box-and-whiskers, with the whiskers corresponding to the minimum (left) and maximum (right), the edges of the box corresponding to the first quartile (left) and third quartile (right), the central line indicating the median value, and extreme values plotted as individual points

To illustrate real-world performance we plot the network inferred by DEKER-NET alongside the networks inferred by GENIE3 in Fig. 4. For GENIE3 network construction we use a range of thresholds also used in the literature [8, 14, 27, 34] ranging from the top 0.2% to top 2% of edges. The 0.2% threshold was omitted from our figures, as this threshold universally resulted in too few edges to be considered. Additional figures for comparison to TIGRESS and correlation, both of whigh perform generally worse than GENIE3, are available in Supplemental Figures. To compare networks we use net edge count (true positive edges minus false positive edges) as a heuristic, which we will refer to as ‘net’ performance. This heuristic captures the dual criteria of increasing network size and reducing the number of errors, where adding a true positive has the same weight as removing a false positive. For ranking purposes this heuristic is equivalent to a reweighted mean of precision and recall, such that adding true positives and removing false positives have an equal effect on that mean.

Fig. 4
figure 4

DEKER-NET performs network inference nearly equal to or better than GENIE3’s best performing edge weight threshold for all DREAM4 in silico multifactorial challenge datasets. GENIE3’s performance depends on the threshold value selected, which is not known in practice and varies substantially even between the similar DREAM4 networks. Threshold values used in this figure are based on commonly used literature values (see text): the top 2%, 1%, 0.8%, and 0.4% of edges. Inferred networks are shaded based on performance relative to the best performing inferred network for each dataset, where the best performing network has the darkest background shading. Networks with less than or equal to 66% of the best performing network’s performance are unshaded. Edge width corresponds to edge weight, with wider edges ranked more highly and assigned higher confidence by inference methods. Edges present in the true network structure but not identified by any inference method are false negatives shown in the true network (top row), while correctly identified edges in the true structure (by at least one method) are true positives shown in both the true network and corresponding inference method network. Edges not present in the true network structure but inferred by a given method are false positives shown in the corresponding inference method network. The total number of true positives (TP), false positives (FP), and net positive edges (TP-FP) are given for each inferred network (Color figure online)

Comparing the networks in Fig. 4 based on net positive edges, DEKER-NET generally performs as well as or better than GENIE3’s best thresholding scenario. DEKER-NET also performs better over all across datasets than any single threshold, as the performance for specific thresholds for GENIE3 is inconsistent based on dataset. Specifically, DEKER-NET has a higher net performance than GENIE3 for networks 1 and 5 for all thresholds, nearly equal net performance to GENIE3’s best threshold for networks 2 and 4, and equal net performance to GENIE3’s second best threshold for network 3 as illustrated by the shading in Fig. 4. Additionally, net performance across thresholding scenarios for GENIE3 illustrates the difficulty in selecting a threshold: for network 1, the top 0.4% of edges was best, for networks 2 and 3 the top 1% of edges was best, and for networks 4 and 5 the top 0.8% of edges was best. Considering that there is no way to gauge error rates to select the best threshold when the underlying network is unknown, this variance in performance across thresholds makes inference of novel networks unreliable for GENIE3. By contrast, we have demonstrated that DEKER-NET is able to reliably achieve equivalent or better performance without thresholding, enabling more effective inference of novel networks.

Benchmarking with increasing sample size

To understand how the performance of DEKER-NET depends on the amount of data used we use the GeneNetWeaver [21, 29] program to simulate additional data sets equivalent to the 100-sample DREAM4 in silico multifactorial datasets tested in the previous section. We used the settings provided in the program that were used to generate the original DREAM4 data. The same five networks used in the DREAM4 challenge were used to simulate the dynamics for our test datasets. Using GeneNetWeaver, we generated 400 additional samples per network, and tested the algorithms as in the previous section using draws of 100, 200, and 400 samples from those generated in each network. We do note generally worse performance for all methods on the DREAM4-like data compared to the DREAM4 benchmarks, which may be due to small differences in software version or variance in the simulated data from run to run.

Precision-recall curves for inference with increasing sample size, shown in Fig. 5, demonstrate a modest increase in the maximum recall that each method achieves for given level of precision as sample size increases. The results suggest that, for the DREAM4 and DREAM4-like data, the majority of relationships that can realistically be inferred are found with a relatively modest amount of data. This is likely due to the inherent difficulty of network inference on multifactorial data. We also note that the precision of the full subset of edges identified by DEKER-NET is generally consistent at \(\ge 60\%\), worse than observed on the DREAM4 data, though non-indirect errors occurred at a very low rate. Figure 6 further illustrates that each doubling of sample size has a less than proportional effect on the network subset that DEKER-NET infers; four times the data generally yields less than twice the number of true edges, with comparable precision.

Fig. 5
figure 5

Precision-recall values improve as the number of samples increase for DREAM4-like datasets. Precision-recall curves are constructed from the series of networks generated by iteratively removing the lowest-weighted edge of each network and calculating precision (ratio of true inferred edges to total inferred edges) and recall (ratio of true inferred edges to total edges in the true network). Each method is plotted as an area, with the upper curve corresponding to the precision and recall values calculated omitting indirect effect errors and the lower curve corresponding to precision and recall values including indirect effect errors. See ‘Results’ main text for details on indirect effect errors (Color figure online)

Fig. 6
figure 6

DEKER-NET identifies larger networks with generally increasing performance as the number of samples increase for DREAM4-like datasets. Inferred networks are shaded based on performance relative to the best performing inferred network for each dataset, where the best performing network has the darkest shading. Networks with less than or equal to 66% of the best performing network’s performance are unshaded. Edge width corresponds to edge weight, with wider edges ranked more highly and assigned higher confidence by inference methods. Edges present in the true network structure but not identified by any inference method are false negatives shown in the true network (top row), while correctly identified edges in the true structure are true positives shown in both the true network and corresponding inference method network. Edges not present in the true network structure but inferred by a given method are false positives shown in the corresponding inference method network. The total number of true positives (TP), false positives (FP), and net positive edges (TP-FP) are given for each inference method network (Color figure online)

Overall, the performance of DEKER-NET as sample size increases suggests that for datatypes and biological networks similar to the gene regulatory network data that GeneNetWeaver emulates, there are diminishing returns in increasing sample size and that 100–200 samples may be an ideal target depending on cost constraints in generating data. Sampling from different populations may be more impactful than increasing sample size in a given population, however this was not possible to test in the given benchmarking framework. Additionally, it is unknown how the size and complexity of the true network to be inferred may affect the number of samples needed for inference.

Discussion

Prior to DEKER-NET, network inference methods were primarily limited by their focus on weighting and ranking all possible relationships, which produces a dense, uninterpretable network structure. Interpretation of these dense networks requires heuristic tuning of a weight threshold to produce a sparse network, a process which is error-prone and not well-defined in the literature. DEKER-NET addresses this limitation by identifying a sparse, interpretable network directly. Furthermore, DEKER-NET incorporates the best-performing weight-focused method, GENIE3 [16] enabling both selection of a sparse network and effective ranking of edges within that network. Through the synergy between feature selection approaches, DEKER-NET maintains the lowest error rates while selecting an interpretable sparse network, addressing an important limitation of earlier methods.

The tests in ‘Results’ focus on realistic scenarios in terms of data quality and quantity, setting expectations for real-world performance of DEKER-NET. These tests use multifactorial data, which best imitates the type of data received from patients with many unknown sources of variation in biology, simulated using a framework intended to imitate a range of realistic network structure and levels of noise [29]. As a result, the networks inferred are incomplete, typically capturing only  20% of the true structure of the biological network used to simulate the data. Regardless, the information gained about this portion of network structure is useful: for networks inferred by DEKER-NET in our tests, 7 of every 10 identified relationships is a true, mechanistic biological relationship. Additionally, the incorrectly inferred relationships are primarily indirect effect errors, which link closely related features that share a common relationship. These indirect effect errors have minimal impact on the overall structure of the network in comparison with more serious errors linking features more distant in the true network topology.

Given the challenges presented by multifactorial data, we believe DEKER-NET is most useful as a hypothesis generating tool. The identified network structure is a reliable, albeit incomplete, description of biological pathways present in the dataset. These pathways can also be considered a focused set of mechanistic hypotheses best supported by the available data, and can direct experimental follow-up to the most relevant biology.

Another important consideration for the application of network inference methods are the range of types of biological data (’omes) to be analyzed, and particularly the fusion of data from distinct domains and technologies into multi-omic datasets. The DREAM4 benchmarking data are intended to imitate gene expression or transcriptomic data, and further work is needed to understand how methods perform on different data types or combinations thereof. In principle, the machine learning methods upon which DEKER-NET is built are robust to different data types, as it does not rely on any assumptions about the underlying distribution of data or the relationships between features, and works with continuous and categorical data. Similarly, biologically plausible results have been shown from applying GENIE3 to real multi-omic data [11, 12]. It remains important to further test performance across data types.

The applications of network inference are broad. In general, the DEKER-NET network inference methodology is a tool for gaining insight into biology. It can be used to identify targets or biomarkers when applied to data with specific endpoints of interest, or to understand translation when applied to data from multiple different species. Most importantly, it supports these critical questions with clear, precise hypotheses and anchors efforts in drug discovery and development to biology in a repeatable and objective way.