Background

Two types of reconstruction method for directed networks have been developed and applied to a variety of experimental datasets. These methods are either based on Bayesian scores [1], [2] or rely on the identification of structural independencies, which correspond to missing edges in the underlying network [3], [4].

Bayesian inference approaches have the advantage of allowing for quantitative comparisons between alternative networks through their Bayesian scores but they are limited to rather small causal graphs due to the super-exponential space of possible directed graphs to sample [1], [5], [6]. Hence, Bayesian inference methods typically require either suitable prior restrictions on the structures [7], [8] or heuristic search strategies such as hill-climbing algorithms [9]–[11].

By contrast, structure learning algorithms based on the identification of structural constraints typically run in polynomial time on sparse underlying graphs. These so-called constraint-based approaches, such as the PC [12] and IC [13] algorithms, do not score and compare alternative networks. Instead they aim at ascertaining conditional independencies between variables to directly infer the Markov equivalent class of all causal graphs compatible with the available observational data. Yet, these methods are not robust to sampling noise in finite datasets as early errors in removing edges from the complete graph typically trigger the accumulation of compensatory errors later on in the pruning process. This cascading effect makes the constraint-based approaches sensitive to the adjustable significance level α, required for the conditional independence tests. In addition, traditional constraint-based methods are not robust to the order in which the conditional independence tests are processed, which prompted recent algorithmic improvements intending to achieve order-independence [14].

In this paper, we report a novel network reconstruction method, which exploits the best of these two types of structure learning approaches. It combines constraint-based and Bayesian frameworks to reliably reconstruct graphical models despite inherent sampling noise in finite observational datasets. To this end, we have developed a robust information-theoretic method to confidently ascertain structural independencies in causal graphs based on the ranking of their most likely contributing nodes. Conditional independencies are derived using an iterative search approach that identifies the most significant indirect contributions to all pairwise mutual information between variables. This local optimization algorithm, outlined below, amounts to iteratively subtracting the most likely conditional 3-point information from 2-point information between each pair of nodes. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. Identifying structural independencies within such a maximum likelihood framework circumvents the need for adjustable significance levels and is found to be more robust to sampling noise from finite observational data, even when compared to constraint-based methods intending to resolve the order-dependence on the variables [14].

Constraint-based methods

Constraint-based approaches, such as the PC [12] and IC [13] algorithms, infer causal graphs from observational data, by searching for conditional independencies among variables. Under the Markov and Faithfulness assumptions, these algorithms return a Complete Partially Directed Acyclic Graph (CPDAG) that represents the Markov equivalent class of the underlying causal structure [3], [4]. They proceed in three steps detailed in Algorithm 1:

  • 1) inferring unnecessary edges and associated separation sets to obtain an undirected skeleton.

  • 2) orienting unshielded triples as v-structures if their middle node is not in the separation set (R 0).

  • 3) propagating as many orientations as possible following propagation rules (R 1 − 3), which prevents the orientation of additional v-structures (R 3) and directed cycles (R 2 − 3) [15].

However, as previously stated, the sensitivity of the constraint-based methods to the adjustable significance level α used for the conditional independence tests and to the order in which the variables are processed (step 1) favors the accumulation of errors when the search procedure relies on finite observational data.

In this paper, we aim at improving constraint-based methods, Algorithm 1, by uncovering the most reliable conditional independencies supported by the (finite) available data, based on a quantitative information theoretic framework.

Maximum likelihood methods

The maximum likelihood G is related to the cross entropy H(G,D)= { x i } p({ x i })log(q({ x i })) between the “true” probability distribution p({x i }) from the data D and the approximate probability distribution q({ x i })= i p( x i |{P a x i }) generated by the Bayesian network G with specific parent nodes {P a x i } for each node x i , leading to [16],

G = e NH ( G , D ) = e N i H x i | Pa x i
(1)

where i H( x i |{ Pa x i }) is the (conditional) entropy of the underlying causal graph. This enables to score and compare alternative models through their maximum likelihood ratio as,

G G = e N i H x i | P a x i H x i | Pa x i
(2)

Note, in particular, that the significance level of the Maximum likelihood approach is set by the number N of independent observational data points, as detailed in the Methods Section below.

Methods

Information theoretic framework

Inferring isolated v-structures vs non-v-structures from 3-point and 2-point information

Applying the previous likelihood definition, Eq. 1, to isolated v-structures (Fig. 1a) and Markov equivalent non-v-structures (Fig. 1bd), one obtains,

v ( xy ) = e N H ( z | x , y ) + H ( x ) + H ( y ) = e N H ( x , y , z ) + I ( x ; y )
(3)
Fig. 1
figure 1

Inference of v-structures versus non-v-structures by 3-point information from observational data. a Isolated v-structures are predicted for I(x;y;z) < 0, and (bd) isolated non-v-structures for I(x;y;z) > 0. e Generalized v-structures are predicted for I(x;y;z|{u i }) < 0 and (fh) generalized non-v-structures for I(x;y;z|{u i }) > 0. In addition, as I(x;y;z|{u i }) are invariant upon xyz permutations, the global orientation of v-structures and non-v-structures also requires to find the most likely base of the xyz triple. Choosing the base xy with the lowest conditional mutual information, i.e., I(x;y|{u i })= minxyz(I(s;t|{u i })), is found to be consistent with the Data Processing Inequality expected for (generalized) non-v-structures in the limit of infinite dataset, see main text. In practice, given a finite dataset, the inference of (generalized) v-structures versus non-v-structures can be obtained by replacing 3-point and 2-point information terms I(x;y|{u i }) and I(x;y;z|{u i }) by shifted equivalents, I (x;y|{u i }) and I (x;y;z|{u i }), including finite size corrections, see text (Eqs. 23 & 24)

where I(x;y)=H(x)+H(y)−H(x,y) is the 2-point mutual information between x and y, and,

nv ( xy ) = e N H ( x | z ) + H ( y | z ) + H ( z ) = e N H ( x , y , z ) + I ( x ; y | z )
(4)

where I(x;y|z)=H(x|z)+H(y|z)−H(x,y|z) is the conditional mutual information between x and y given z. Hence, one obtains the likelihood ratio,

v ( xy ) nv ( xy ) = e N I ( x ; y ) I ( x ; y | z ) = e NI ( x ; y ; z )
(5)

where we introduced the 3-point information function, I(x;y;z)=I(x;y)−I(x;y|z), which is in fact invariant upon permutations between x,y and z, as seen in terms of entropy functions,

I ( x ; y ; z ) = H ( x ) + H ( y ) + H ( z ) H ( x , y ) H ( x , z ) H ( y , z ) + H ( x , y , z )
(6)

As long recognized in the field [17], [18], 3-point information, I(x;y;z), can be positive or negative (if I(x;y)<I(x;y|z)), unlike 2-point mutual information, which are always positive, I(x;y)0.

More precisely, Eq. 5 demonstrates that the sign and magnitude of 3-point information provide a quantitative estimate of the relative likelihoods of isolated v-structures versus non-v-structures, which are in fact independent of their actual non-connected bases xy, xz or yz,

v ( xy ) nv ( xy ) = v ( xz ) nv ( xz ) = v ( yz ) nv ( yz ) = e NI ( x ; y ; z )
(7)

Hence, a significantly negative 3-point information, I(x;y;z)<0, implies that a v-structure is more likely than a non-v-structure given the observed correlation data. Conversely, a significantly positive 3-point information, I(x;y;z)>0, implies that a non-v-structure model is more likely than a v-structure model.

Yet, as noted above, 3-point information, I(x;y;z), being symmetric by construction, it cannot indicate how to orient v-structures or non-v-structures over the xyz triple. To this end, it is however straightforward to show that the most likely base (xy, xz or yz) of the local v-structure or non-v-structure corresponds to the pair with lowest mutual information, e.g., I(x;y)= min xyz ( I ( s ; t ) ) , as shown by the likelihood ratios,

v ( xy ) v ( st ) = nv ( xy ) nv ( st ) = e NI ( x ; y ) e NI ( s ; t )
(8)

Note, in particular, that choosing the base with the lowest mutual information is consistent with the Data Processing Inequality expected for non-v-structures, Fig. 1bd.

Hence, combining 3-point and 2-point information allows to determine the likelihood and the base of isolated v-structures versus non-v-structures. But how to extend such simple results to identify local v-structures and non-v-structures embedded within an entire graph G?

Inferring embedded v-structures vs non-v-structures from conditional 3-point and 2-point information

To go from isolated to embedded v-structures and non-v-structures within a DAG G, we will consider the Markov equivalent CPDAG of G and introduce generalized v-structures and non-v-structures, Fig. 1eh. We will demonstrate that their relative likelihood, given the available observational data, can be estimated from the sign and magnitude of a conditional 3-point information, I(x;y;z|{u i }), Eq. 11. This will extend our initial result valid for isolated v-structures and non-v-structures, Eq. 7.

Let’s consider a pair of non-neighbor nodes x,y with a set of upstream nodes {u i } n , where each node u i has at least one direct connection to x (u i x) or y (u i y) or to another upstream node u j ∈{u i } n (u i u j ) or only undirected links to these nodes (u i x, u i y or u i u j ). Thus, given x,y and a set of upstream nodes {u i } n , any additional node zcan either be:

  • i) at the apex of a generalized v-structure, if all existing connections between x, y, {u i } n and z are directed and point towards z, Fig. 1e, or else,

  • ii) z has at least one undirected link with x, y or one of the upstream nodes u i (zx, zy or zu i ) or at least one directed link pointing towards these nodes (zx, zy or zu i ), Fig. 1fh. In such a case, z might contribute to the mutual information I(x;y) and should be included in the set of upstream nodes {u i } n , thereby defining a generalized non-v-structure, Figs. 1fh.

Then, similarly to the case of an isolated v-structure (Eq. 3), the maximum likelihood v (xy) of a generalized v-structure pointing towards z from a base xy with upstream nodes {u i } n can be expressed as,

v ( xy ) = e N H ( z | x , y , { u i } ) + H ( x | { u i } ) + H ( y | { u i } ) + H ( { u i } ) = e N H ( x , y , z , { u i } ) + I ( x ; y | { u i } )
(9)

where I(x;y|{u i }) is the conditional mutual information between x and y given {u i }, I(x;y|{u i })=H(x|{u i })+H(y|{u i })−H(x,y|{u i })−H({u i }).

Likewise, the maximum likelihood nv (xy) of a generalized non-v-structure of base xy with upstream nodes {u i } n and z can be expressed as,

nv ( xy ) = e N H ( x | z , { u i } ) + H ( y | z , { u i } ) + H ( z , { u i } ) = e N H ( x , y , z , { u i } ) + I ( x ; y | z , { u i } )
(10)

where I(x;y|z,{u i })=H(x|z,{u i })+H(y|z,{u i })−H(x,y|z, {u i })−H(z,{u i }) is the conditional mutual information between x and y given z and {u i }. Hence,

v ( xy ) nv ( xy ) = e NI ( x ; y ; z | { u i } )
(11)

where we introduced the conditional 3-point information, I(x;y;z|{u i })=I(x;y|{u i })−I(x;y|z,{u i }).

Hence, a significantly negative conditional 3-point information, I(x;y;z|{u i })<0, implies that a generalized v-structure is more likely than a generalized non-v-structure given the available observational data. Conversely, a significantly positive conditional 3-point information, I(x;y;z|{u i })>0, implies that a generalized non-v-structure model is more likely than a generalized v-structure model.

Yet, as the conditional 3-point information, I(x;y;z|{u i }), is in fact invariant upon permutations between x,y and z, it cannot indicate how to orient embedded v-structures or non-v-structures over the xyz triple, as already noted in the case of isolated v-structures and non-v-structures, above.

However, the most likely base (xy, xz or yz) of the embedded v-structure or non-v-structure corresponds to the least correlated pair conditioned on {u i }, e.g., I(x;y|{ u i })= min xyz ( I ( s ; t | { u i } ) ) , as shown with the following likelihood ratios,

v ( xy ) v ( st ) = nv ( xy ) nv ( st ) = e NI ( x ; y | { u i } ) e NI ( s ; t | { u i } )
(12)

Note, in particular, that choosing the base with the lowest conditional mutual information, e.g., I(x;y|{ u i })= min xyz ( I ( s ; t | { u i } ) ) , is consistent with the Data Processing Inequality expected for the generalized non-v-structure of Fig. 1fh, I(x;y)min ( I ( x ; z , { u i } ) , I ( z , { u i } ; y ) ) , as shown below for I(x;y) and I(x;z,{u i }), by subtracting I(x;y;z|{u i }) on each side of the inequality I(x;y|{u i })≤I(x;z|{u i }), leading to,

I ( x ; y | z , { u i } ) I ( x ; z | { u i } , y ) I ( x ; z | { u i } , y ) + I ( x ; { u i } | y ) I ( x ; z , { u i } | y ) I ( x ; y ) I ( x ; z , { u i } )
(13)

where we have used the chain rule, I(x;z,{u i }|y)=I(x;z|{u i },y)+I(x;{u i }|y), before adding I(x;y;z,{u i }) on each side of the inequality. The corresponding inequality holds between I(x;y) and I(z,{u i };y), implying the Data Processing Inequality.

Finite size corrections of maximum likelihood

Maximum likelihood ratios, such as Eq. 2, suggest that 1/N sets the significance level of the maximum likelihood approach, as H(G,D)H( G ,D)1/N should imply a significant improvement of the underlying model G over G. In practice, however, there are O(log(N)/N) corrections coming from the proper normalization of maximum likelihoods (see Appendix),

G = e N i H x i | Pa x i Z ( G , D )
(14)

The model G can then be compared to the alternative model G x y with one missing edge xy using the maximum likelihood ratio,

G x y G = e NI ( x ; y | { Pa y } x ) Z ( G , D ) Z ( G x y , D )
(15)

where I(x;y|{Pa y } x )=H(y|{Pa y } x )−H(y|{Pa y }).

Then, following the rationale of constraint-based approaches, Eq. 15 can be reformulated by replacing the parent nodes {Pa y } x with an unknown separation set {u i } to be learnt simultaneously with the missing edge candidate xy,

G xy | { u i } G = e NI ( x ; y | { u i } ) + k x ; y | { u i }
(16)
k x ; y | { u i } = log Z ( G , D ) / Z ( G xy | { u i } , D )
(17)

where the factor k x ; y | { u i } >0 tends to limit the complexity of the models by favoring fewer edges. Namely, the condition, I(x;y|{ u i })< k x ; y | { u i } /N, implies that simpler models compatible with the structural independency, x ⊥ ⊥ y|{u i }, are more likely than model G, given the finite available dataset. This replaces the ‘perfect’ conditional independency condition, I(x;y|{u i })=0, valid in the limit of an infinite dataset, N. A common complexity criteria in model selection is the Bayesian Information Criteria (BIC) or Minimal Description Length (MDL) criteria [19], [20],

k x ; y | { u i } MDL = 1 2 ( r x 1)( r y 1) i r u i logN
(18)

where r x ,r y and r u i are the number of levels of the corresponding variables. The MDL complexity, Eq. 18, is simply related to the normalisation constant of the distribution reached in the asymptotic limit of a large dataset N (Laplace approximation). However, this limit distribution is only reached for very large datasets in practice.

Alternatively, the normalisation of the maximum likelihood can also be done over all possible datasets including the same number of data points to yield a (universal) Normalized Maximum Likelihood (NML) criteria [21], [22] and its decomposable [23], [24] and xy-symmetric version, k x ; y | { u i } NML , defined in the Appendix.

Then, incrementing the separation set of xy from {u i } to {u i }+z leads to the following likelihood ratio,

G xy | { u i } , z G xy | { u i } = e NI ( x ; y ; z | { u i } ) + k x ; y ; z | { u i }
(19)

with I(x;y;z|{u i })=I(x;y|{u i })−I(x;y|{u i },z) and where we introduced a 3-point conditional complexity, k x ; y ; z | { u i } , defined similarly as the difference between the 2-point conditional complexities,

k x ; y ; z | { u i } = k x ; y | { u i } , z k x ; y | { u i }
(20)

However, unlike 3-point information, I(x;y;z|{u i }), 3-point complexities are always positive, k x ; y ; z | { u i } >0, provided that there are at least two levels for each implicated node x,y,z,{u i }, i.e. r 2.

Hence, we can define the shifted 2-point and 3-point information in Eqs. 16 & 19 for finite datasets as,

I ( x ; y | { u i } ) = I ( x ; y | { u i } ) k x ; y | { u i } N
(21)
I ( x ; y ; z | { u i } ) = I ( x ; y ; z | { u i } ) + k x ; y ; z | { u i } N
(22)

This leads to the following maximum likelihood ratios equivalent to Eqs. 11 & 12 for v-structure over non-v-structure and between alternative bases,

v ( xy ) nv ( xy ) = e N I ( x ; y ; z | { u i } )
(23)
v ( xy ) v ( st ) = nv ( xy ) nv ( st ) = e N I ( x ; y | { u i } ) e N I ( s ; t | { u i } )
(24)

Hence, given a finite dataset, a significantly negative conditional 3-point information, corresponding to I (x;y;z|{u i })<0, implies that a v-structure xzy is more likely than a non-v-structure provided that the structural independency, x ⊥ ⊥ y|{u i }, is also confidently established as, I (x;y|{u i })<0. By contrast, a significantly positive conditional 3-point information corresponds to I (x;y;z|{u i })>0 and implies that a non-v-structure model is more likely than a v-structure model, given the available observational data.

Probability estimate of indirect contributions to mutual information

The previous results enable us to estimate the probability of a node z to contribute to the conditional mutual information I(x;y|{u i }), by combining the probability, P n v (x y z|{u i }), that the triple xyz is a generalized non-v-structure conditioned on {u i } and the probability, P b (x y|{u i }), that its base is xy, where,

P nv ( xyz | { u i } ) = nv ( xy ) nv ( xy ) + v ( xy )
(25)
P b ( xy | { u i } ) = nv ( xy ) nv ( xy ) + nv ( xz ) + nv ( yz )
(26)

that is, using Eqs. 23 & 24 including finite size corrections of the maximum likelihoods,

P nv ( xyz | { u i } ) = 1 1 + e N I ( x ; y ; z | { u i } )
(27)
P b ( xy | { u i } ) = 1 1 + e N I ( x ; z | { u i } ) e N I ( x ; y | { u i } ) + e N I ( y ; z | { u i } ) e N I ( x ; y | { u i } )
(28)

Then, various alternatives to combine P n v (x y z|{u i }) and P b (x y|{u i }) exist to estimate the overall probability that the additional node z indirectly contributes to I(x;y|{u i }). One possibility is to choose the lower bound S l b (z;x y|{u i }) of P n v (x y z|{u i }) and P b (x y|{u i }), since both conditions need to be fulfilled to warrant that z indeed contributes to I(x;y|{u i }),

S lb ( z ; xy | { u i } ) = min P nv ( xyz | { u i } ) , P b ( xy | { u i } )
(29)

The pair of nodes xy with the most likely contribution from a third node z can then be ordered according to their rank R(x y;z|{u i }) defined as,

R ( xy ; z | { u i } ) = max z S lb ( z ; xy | { u i } )
(30)

and z can be iteratively added to the set of contributing nodes (i.e. {u i }←{u i }+z) of the top link x y=argmax xy R(x y;z|{u i }) to progressively recover the most significant indirect contributions to all pairwise mutual information in a causal graph, as outlined below.

Robust inference of conditional independencies using the 3off2 scheme

The previous results can be used to provide a robust inference method to identify conditional independencies and, hence, reconstruct the skeleton of underlying causal graphs from finite available observational data. The approach follows the spirit of constraint-based methods, such as the PC or IC algorithms, but recovers conditional independencies following an evolving ranking of the network edges, R(x y;z|{u i }), defined in Eq. 30.

All in all, this amounts to perform a generic decomposition for each mutual information term, I(x;y), by introducing a succession of node candidates, u 1,u 2,…, u n , that are likely to contribute to the overall mutual information between the pair x and y, as,

I ( x ; y ) = I ( x ; y ; u 1 ) + I ( x ; y | u 1 ) = I ( x ; y ; u 1 ) + I ( x ; y ; u 2 | u 1 ) + + I ( x ; y ; u n | { u i } n 1 ) + I ( x ; y | { u i } n )
(31)

or equivalently between the shifted 2-point and 3-point information terms including finite size corrections (Eq. 22),

I ( x ; y ) = I ( x ; y ; u 1 ) + I ( x ; y ; u 2 | u 1 ) + + I ( x ; y ; u n | { u i } n 1 ) + I ( x ; y | { u i } n )
(32)

Hence, given a significant mutual information between x and y, I (x;y)>0, we will search for possible structural independencies, i.e. I (x;y|{u i } n )<0, by iteratively “taking off” conditional 3-point information terms from the initial 2-point (mutual) information, I (x;y), as

I ( x ; y | { u i } n ) = I ( x ; y ) I ( x ; y ; u 1 ) I ( x ; y ; u 2 | u 1 ) I ( x ; y ; u n | { u i } n 1 )
(33)

and similarly with non-shifted 2-point and 3-point information,

I ( x ; y | { u i } n ) = I ( x ; y ) I ( x ; y ; u 1 ) I ( x ; y ; u 2 | u 1 ) I ( x ; y ; u n | { u i } n 1 )
(34)

3off2 algorithm

The 3off2 scheme can be used to devise a two-step algorithm (see Algorithm 2), inspired by constraint-based approaches, to first reconstruct network skeleton (Algorithm 2, step 1) before combining orientation and propagation of edges in a single step based on likelihood ratios (Algorithm 2, step 2).

Reconstruction of network skeleton

The 3off2 scheme will first be applied to iteratively remove edges with maximum positive contributions, I (x;y;u k |{u i } k −1)>0, corresponding to the most likely generalized non-v-structures (Eq. 23), while minimizing simultaneously the remaining 2-point information, I (x;y|{u i } k ) (Eq. 24), consistently with the data processing inequality. Such 3off2 scheme (Algorithm 2, step 1) will therefore progressively lower the conditional 2-point information terms, I (x;y)>⋯>I (x;y|{u i } k −1)>I (x;y|{u i } k ) and might ultimately result in the removal of the corresponding edge, xy, but only when a structural independency is actually found, i.e. I (x;y|{u i } n )<0, as in constraint-based algorithms for a given significance level α. Yet, the skeleton obtained with the 3off2 scoring approach is expected to be more robust to finite observational data than the skeleton obtained with PC or IC algorithms, as the former results only from statistically significant 3-point contributions, I (x;y;u k |{u i } k −1)>0, based on their quantitative 3off2 ranks, R(x y;u k |{u i } k −1).

The best results on benchmark networks using these quantitative 3off2 ranks are obtained with the NML score (see Results and discussion Section below). The MDL score leads to equivalent results, as expected, in the limit of very large datasets (see Appendix). However, with smaller datasets, the most reliable results with the MDL score are obtained using non-shifted instead of shifted 2-point and 3-point information terms in the 3off2 rank of individual edges, Eq. 30. This is because the MDL complexity tends to underestimate the importance of edges between nodes with many levels (see Appendix). For finite datasets, it easily leads to spurious conditional independencies, I (x;y|{u i})<0, when using shifted 2-point and 3-point information, Eq. 33, whereas using non-shifted information in the 3off2 ranks (Eq. 30) tends to limit the number of false negatives as early errors in {u i } can only increase I(x;y|{ui})0, in the end, in Eq. 34.

Orientation of network skeleton

The skeleton and the separation sets resulting from the 3off2 iteration step (Algorithm 2, step 1) can then be used to orient the edges and to propagate orientations to the unshielded triples. However, while the constraint-based methods distinguish the v-structures orientation step (Algorithm 2, step 2) from the propagation procedure (Algorithm 1, step 3), the 3off2 algorithm intertwines these two steps based on the respective likelihood scores of individual v-structures and non-v-structures (Algorithm 2, step 2).

As stated earlier, the magnitude and sign of the conditional 3-point information, I(x;y;z|{u i }) (or equivalently the shifted 3-point information, Eq. 23), indicate if a non v-structure is more likely than a v-structure. Hence, all the unshielded triples can be ranked by the absolute value of their conditional 3-point information, that is, in decreasing order of their likelihood of being either a v-structure or a non-v-structure. As detailed in the step 2 of Algorithm 2, the most likely v-structure is used to set the first orientations, following R 0 orientation rule. The possible propagations are then performed, following R 1 propagation rule, starting from the unshielded triple having the most positive conditional 3-point information. The following most likely v-structure is considered when no further propagation is possible on unshielded triples with greater absolute 3-point information. If conflicting orientations arise (such as abc & bcd), the less likely v-structure and its possible propagations are ignored.

Note that we only implement the R 0 and R 1 propagation rules, which are applied in decreasing order of likelihood. In particular, we do not consider propagation rules R 2 and R 3 which are not associated to likelihood scores but enforce the hypothesis of acyclic constraint.

As for the 3off2 skeleton reconstruction, the orientation/propagation step of 3off2 allows for a robust discovery of orientations from finite observational data as it relies on a quantitative framework of likelihood ratios taken in decreasing order of their statistical significance. During this step, 3off2 recovers and propagates as many orientations as possible in an iterative procedure following the decreasing ranks of the unshielded triples based on the absolute value of their conditional 3-point information, |I (x;y;z|{u i })|.

Results and discussion

Tests on benchmark graphs

We have tested the 3off2 network reconstruction approach to learn benchmark causal graphs containing 20 to 70 nodes, Figs. 2, 3, 4, 5 and 6. The results are evaluated against other methods in terms of Precision (or positive predictive value), P r e c=T P/(T P+F P), Recall or Sensitivity (true positive rate), R e c=T P/(T P+F N), as well as F-score =2×P r e c×R e c/(P r e c+R e c) for increasing sample size N=10 to 50,000 data points.

Fig. 2
figure 2

CHILD network. [20 nodes, 25 links, 230 parameters, Average degree 2.5, Maximum in-degree 2]. Precision, Recall and F-score for skeletons (dashed lines) and CPDAGs (solid lines). The results are given for Aracne (black), PC (blue), Bayesian Hill-Climbing (green) and 3off2 (red)

Fig. 3
figure 3

ALARM network. [37 nodes, 46 links, 509 parameters, Average degree 2.49, Maximum in-degree 4]. Precision, Recall and F-score for skeletons (dashed lines) and CPDAGs (solid lines). The results are given for Aracne (black), PC (blue), Bayesian Hill-Climbing (green) and 3off2 (red)

Fig. 4
figure 4

INSURANCE network. [27 nodes, 52 links, 984 parameters, Average degree 3.85, Maximum in-degree 3]. Precision, Recall and F-score for skeletons (dashed lines) and CPDAGs (solid lines). The results are given for Aracne (black), PC (blue), Bayesian Hill-Climbing (green) and 3off2 (red)

Fig. 5
figure 5

BARLEY network. [48 nodes, 84 links, 114,005 parameters, Average degree 3.5, Maximum in-degree 4]. Precision, Recall and F-score for skeletons (dashed lines) and CPDAGs (solidlines). The results are given for Aracne (black), PC (blue), Bayesian Hill-Climbing (green) and 3off2 (red)

Fig. 6
figure 6

HEPAR II network. [70 nodes, 123 links, 1,453 parameters, Average degree 3.51, Maximum in-degree 6]. Precision, Recall and F-score for skeletons (dashed lines) and CPDAGs (solid lines). The results are given for Aracne (black), PC (blue), Bayesian Hill-Climbing (green) and 3off2 (red)

We also define additional Precision, Recall and F-scores taking into account the edge orientations of the predicted networks against the corresponding CPDAG of the benchmark networks. This amounts to label as false positives, all true positive edges of the skeleton with different orientation/non-orientation status as the CPDAG reference, T P misorient, leading to the orientation-dependent definitions T P =T PT P misorient and F P =F P+T P misorient with the corresponding CPDAG Precision, Recall and F-scores taking into account edge orientations.

The alternative inference methods used for comparison with 3off2 are the PC algorithm [12] implemented in the pcalg package [25], [26] and Bayesian inference using the hill-climbing heuristics implemented in the bnlearn package [27]. In addition, we also compare the skeleton of 3off2 to the unoriented output of Aracne [28], an information-based inference approach, which iteratively prunes links with the weakest mutual information based on the Data Processing Inequality. We have used the Aracne implementation of the minet package [29]. For each sample size, 3off2, Aracne, PC and the Bayesian inference methods have been tested on 50 replicates. Figures 2, 3, 4, 5 and 6 give the average results over these multiple replicates when comparing the CPDAG (solid lines) of the reconstructed network (or its skeleton, dashed lined) to the CPDAG (or the skeleton) of the benchmark network.

For each method, the plots presented in Figs. 2, 3, 4, 5 and 6 are those obtained for the parameters that give overall the best results over the five reconstructed benchmark networks (see Additional file 1, Figures S1-S20). In particular, we used the stable implementation of the PC algorithm, as well as the majority rule for the orientation and propagation steps [14]. PC’s results are shown on Figs. 2, 3, 4, 5 and 6 for α=0.1. Decreasing α tends to improve the skeleton Precision at the expense of the skeleton Recall, leading in fact to worse skeleton F-scores for finite datasets, e.g. N≤1000 (see Additional file 1, Figures S1-S5). The same trend is observed for CPDAG F-scores taking into account edge orientations, with best CPDAG scores at small sample sizes, obtained for larger α, e.g. N≤1000. Aracne threshold parameters for minimum difference in mutual information is set to ε=0, as small positive values typically worsen F-scores (see Additional file 1, Figures S6-S10). Bayesian inference are obtained using BIC/MDL scores and hill-climbing heuristics with 100 random restarts [9] (see Additional file 1, Figures S11-S15). Finally, the best 3off2 network reconstructions are obtained using NML scores with shifted 2-point and 3-point information terms in the rank of individual edges, see Methods. Using MDL scores, instead, leads to equivalent results, as expected, in the limit of very large datasets (see Appendix). However, with smaller datasets, the most reliable results with MDL scores are obtained using non-shifted instead of shifted 2-point and 3-point information terms in the 3off2 rank of individual edges, as discussed in Methods (see Additional file 1, Figures S16-S20).

All in all, we found that the 3off2 inference approach typically reaches better or equivalent F-scores for all dataset sizes as compared to all other tested methods, i.e. Aracne, PC and Bayesian inference, as well as the Max-Min Hill-Climbing (MMHC) hybrid method [30] (see Additional file 1, Figures S21-S25). This is clearly observed both on the skeletons (Figs. 2, 3, 4, 5 and 6 dashed lines) and even more clearly when taking the predictions of orientations into account (Figures 2, 3, 4, 5 and 6 solid lines).

Applications to the hematopoiesis regulation network

The reconstruction or reverse-engineering of real regulatory networks from actual expression data has already been performed on a number of biological systems (see e.g. [28], [31]–[33]). Here, we apply the 3off2 approach on a real biological dataset related to hematopoiesis. Transcription factors play a central role in hematopoiesis, from which derive the blood cell lineages. As suggested in previous studies, changes in the regulatory interactions among transcription factors [34] or their overexpression [35] might be involved in the development of T-acute lymphoblastic leukaemia (T-ALL). The key role of the hematopoiesis and the potentially serious consequences of its disregulations emphasize the need to accurately establish the complex interactions between the transcription factors involved in this critical biological process.

The dataset we have used for this analysis [36] consists of the single cell expressions of 18 transcription factors, known for their role in hematopoiesis. Five hundred ninety seven single cells representing 5 different types of hematopoietic progenitors have been included in the analysis (N=597). We reconstructed the corresponding network with the 3off2 inference method, Fig. 7, and four other available approaches, namely, PC [12] implemented in the pcalg package [25], [26], Bayesian inference using hill-climbing heuristics as well as the Max-Min Hill-Climbing (MMHC) hybrid method [30], both implemented in the bnlearn package [27], and, finally, Aracne [28] implemented in the minet package [29] (Table 1 and Additional file 1: Table S1).

Fig. 7
figure 7

Hematopoietic subnetwork reconstructed by 3off2. The dataset [36] concerns 18 transcription factors, 597 single cells, 5 different hematopoietic progenitor types. Red and blue edges correspond to experimentally proven activations and repressions, respectively as reported in the literature (Table 1), while grey links indicate regulatory interactions for which no clear evidence has been established so far. Thinner arrows underline 3off2 misorientations

Table 1 Interactions reconstructed by 3off2 and alternative methods for a subnetwork of hematopoiesis regulation. → indicates a successfully recovered interaction including its direction as reported in the literature (see References). corresponds to a successfully recovered interaction, however, with an opposite direction as reported in the literature. ⌿ stipulates that no direct regulatory interaction has been inferred, while — corresponds to an undirected link. Note in particular that Aracne does not infer edge direction. See Additional file 1: Table S1 for supplementary statistics

3off2 uncovers all 11 interactions for which specific experimental evidence has been reported in the literature (Fig. 7, red links: known activations; blue links: known repressions) as well as 30 additional links (Fig. 7, grey links: unknown regulatory interactions). By contrast, randomization of the actual data across samples for each TF leads to only 5.25 spurious interactions on average between the 18 TFs, instead of the 41 inferred edges from the actual data, and 1.62 spurious interactions on average, instead of the 16 interactions predicted among the 10 TFs involved in known regulatory interactions, Fig. 7. This suggests that around 10–13 % of the predicted edges might be spurious, due to inevitable sampling noise in the finite dataset. In particular, the 3off2 inference approach successfully recovers the relationships of the regulatory triad between Gata2, Gfi1b and Gfi1 as described in [36] and reports correct orientations for the edges involving Gata2 (Gfi1b and Gfi1 crossregulate in fact one another [36], Table 1). The network reconstructed by 3off2 also correctly infers the regulations of PU.1 by Gfi1 [37], Gfi1 by Lyl1 [38], Meis1 by Ldb1 [39], and the regulations of Lyl1 by Ldb1 [39] and Erg [40]. Finally, the interactions (Gata2SCL) [40], (Gfi1bMeis1) [41] and (Gata1Gata2) [42] are correctly inferred, however, with opposite directions as reported in the literature. Yet, overall 3off2 outperforms most of the other methods tested for the reconstruction of the hematopoietic regulatory subnetwork (Table 1 and Additional file 1: Table S1). Only the Bayesian hill-climbing method using a BDe score leads to comparable results by retrieving 10 out of 11 interactions and correctly orienting 8 of them. These encouraging results from the 3off2 reconstruction method on experimentally proven regulatory interactions (red edges in Fig. 7) could motivate further investigations on novel regulatory interactions awaiting to be tested for their possible role in hematopoiesis (e.g. grey edges in Fig. 7).

Conclusions

In this paper, we propose to improve constraint-based network reconstruction methods by identifying structural independencies through a robust quantitative score-based scheme limiting the accumulation of early FN errors and subsequent FP compensatory errors. In brief, 3off2 relies on information theoretic scores to progressively uncover the best supported conditional independencies, by iteratively “taking off” the most likely indirect contributions of conditional 3-point information from every 2-point (mutual) information of the causal graph.

Earlier hybrid methods have also attempted to improve network reconstruction by combining the concepts of constraint-based approaches with the robustness of Bayesian scores [30], [43]–[45]. In particular [43], have proposed to exploit an intrinsic weakness of the PC algorithm, its sensitivity to the order in which conditional independencies are tested on finite data, to rank these different order-dependent PC predictions with Bayesian scores. More recently [30], have also combined constraint-based and Bayesian approaches by first identifying both parents and children of each node of the underlying graphical model and then performing a greedy Bayesian hill-climbing search restricted to the identified parents and children of each node. This Max-Min Hill-Climbing (MMHC) approach tends to have a high precision in terms of skeleton but a more limited sensibility, leading overall to lower skeleton and CPDAG F-scores than 3off2 and Bayesian hill climbing methods on the same benchmark networks, Figures S21-S25. Interestingly, however, the MMHC approach is among the fastest network reconstruction approaches, Figure S26, allowing for scalability to large network sizes [30].

The 3off2 algorithm is expected to run in polynomial time on typical sparse causal networks with low in-degree, just like constraint-based algorithms.However, in practice and despite the additional computation of conditional 2-point and 3-point information terms, we found that the 3off2 algorithm runs typically faster than constraint-based algorithms for large enough samples, by avoiding the cascading accumulation of errors that inflate the combinatorial search of conditional independencies in traditional constraint-based approaches. Instead, we found that 3off2 running time displays a similar trend as Bayesian hill-climbing heuristic methods, Figs. 2, 3, 4, 5 and 6.

All in all, the main computational bottleneck of the present 3off2 scheme pertains to the identification of the best contributing nodes at each iteration. In the future, it could be interesting to investigate whether a more stochastic version of this 3off2 method, based on choosing one significant conditional 3-point information instead of the best one, might simultaneously accelerate the network reconstruction and circumvent possible locally trapped suboptimal predictions through stochastic resampling.

Finally, another perspective for practical applications will be to include the possibility of latent variables and bidirected edges in reconstructed networks.

Appendix

Complexity of graphical models

The complexity k G , D of a graphical model is related to the normalization constant Z(G,D) of its maximum likelihood as k G , D =logZ(G,D),

G = e NH ( G , D ) Z ( G , D ) = e NH ( G , D ) k G , D
(35)

For Bayesian networks with decomposable entropy, i.e. H(G,D)= i H( x i |{ Pa x i }), it is convenient to use decomposable complexities, k G , D = i k x i | Pa x i ,

G = e N i H x i | Pa x i i k x i | Pa x i
(36)

such that the comparison between alternative models G and G x y (i.e. G with one missing edge xy) leads to a simple local increment of the score,

G x y G = e NI ( x ; y | { Pa y } x ) + Δ k y | Pa y x
(37)
I ( x ; y | { Pa y } x ) = H ( y | { Pa y } x ) H ( y | { Pa y } ) 0
(38)
Δ k y | { Pa y } x = k y | { Pa y } k y | { Pa y } x 0
(39)

A common complexity criteria in model selection is the Bayesian Information Criteria (BIC) or Minimal Description Length (MDL) criteria [19], [20],

k y | { Pa y } MDL = 1 2 ( r y 1 ) j Pa y r j log N
(40)
Δ k y | { Pa y } x MDL = 1 2 ( r x 1 ) ( r y 1 ) j Pa y x r j log N
(41)

where r x ,r y and r j are the number of levels of each variable, x, y and j. The MDL complexity, Eq. 40, is simply related to the normalisation constant reached in the asymptotic limit of a large dataset N (Laplace approximation). The MDL complexity can also be derived from the Stirling approximation on the Bayesian measure [46], [47]. Yet, in practice, this limit distribution is only reached for very large datasets, as some of the least-likely ( r y 1) j r j combinations of states of variables are in fact rarely (if ever) sampled in typical finite datasets. As a result, the MDL complexity criteria tends to underestimate the relevance of edges connecting variables with many levels, r i , leading to the removal of false negative edges.

To avoid such biases with finite datasets, the normalisation of the maximum likelihood can be done over all possible datasets with the same number N of data points. This corresponds to the (universal) Normalized Maximum Likelihood (NML) criteria [21]–[24],

G = e NH ( G , D ) | D | = N e NH ( G , D ) = e NH ( G , D ) k G , D NML
(42)

We introduce here the factorized version of the NML criteria [23], [24] which corresponds to a decomposable NML score, k G , D NML = x i k x i | { Pa x i } NML , defined as,

k y | { Pa y } NML = j q y log C N yj r y
(43)
Δ k y | { Pa y } x NML = j q y log C N yj r y j q y / r x log C N y j r y
(44)

where N yj is the number of data points corresponding to the jth state of the parents of y, {Pa y }, and N y j the number of data points corresponding to the j th state of the parents of y, excluding x, {Pa y } x . Hence, the factorized NML score for each node x i corresponds to a separate normalisation for each state j=1,…,q i of its parents and involving exactly N ij data points of the finite dataset,

G = e N i H ( x i | { Pa x i } ) i j q i log C N ij r i
(45)
= e N i j q i k r i N ijk N log N ijk N ij i j q i log C N ij r i
(46)
= i j q i k r i N ijk N ij N ijk C N ij r i
(47)

where N ijk corresponds to the number of data points for which the ith node is in its kth state and its parents in their jth state, with N ij = k r i N ijk . The universal normalization constant C n r is then obtained by averaging over all possible partitions of the n data points into a maximum of r subsets, 1+ 2+⋯+ r =n with k 0,

C n r = 1 + 2 + + r = n n ! 1 ! 2 ! r ! k = 1 r k n k
(48)

which can in fact be computed in linear-time using the following recursion [23],

C n r = C n r 1 + n r 2 C n r 2
(49)

with C 0 r =1 for all r, C n 1 =1 for all n and applying the general formula Eq. 48 for r=2,

C n 2 = h = 0 n n h h n h n h n n h
(50)

or its Szpankowski approximation for large n (needed for n>1000 in practice) [48]–[50],

C n 2 = 2 1 + 2 3 2 + 1 12 n + O 1 n 3 / 2
(51)
2 exp 8 9 + 3 π 16 36
(52)

Then, following the rationale of constraint-based approaches, we can reformulate the likelihood ratio of Eq. 37 by replacing the parent nodes {Pa y } x in the conditional mutual information, I(x;y|{Pa y } x ), with an unknown separation set {u i } to be learnt simultaneously with the missing edge candidate xy,

G xy | { u i } G = e NI ( x ; y | { u i } ) + k x ; y | { u i }
(53)

where we have also transformed the asymmetric parent-dependent complexity difference, Δ k y | { Pa y } x , into a {u i }-dependent complexity term, k x ; y | { u i } , with the same xy-symmetry as I(x;y|{u i }),

k x ; y | { u i } MDL = 1 2 ( r x 1 ) ( r y 1 ) i r u i log N
(54)
k x ; y | { u i } NML = 1 2 j { u i } k x r x log C N k x j r y log C N j r y + k y r y log C N k y j r x log C N j r x
(55)

Note, in particular, that the MDL complexity term in Eq. 54 is readily obtained from Eq. 41 due to the Markov equivalence of the MDL score, corresponding to its xy-symmetry whenever {Pa y } x ={Pa x } y . By contrast, the factorized NML score, Eq. 43, is not a Markov-equivalent score (although its non-factorized version, Eq. 42, is Markov equivalent by definition). To circumvent this non-equivalence of factorized NML score, we propose to recover the expected xy-symmetry of k x ; y | { u i } NML through the simple xy-symmetrization of Eq. 44, leading to Eq. 55.

Publication costs

Publication costs for this article were funded by the Région Ile-de-France.

Declarations

This article has been published as part of BMC Bioinformatics Volume 17 Supplement 2, 2016: Bringing Maths to Life (BMTL). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements.

Additional file