## Abstract

### Background

The discovery of molecular pathways is a challenging problem and its solution relies on the identification of causal molecular interactions in genomics data. Causal molecular interactions can be discovered using randomized experiments; however such experiments are often costly, infeasible, or unethical. Fortunately, algorithms that infer causal interactions from observational data have been in development for decades, predominantly in the quantitative sciences, and many of them have recently been applied to genomics data. While these algorithms can infer unoriented causal interactions between involved molecular variables (i.e., without specifying which one is the cause and which one is the effect), causally orienting all inferred molecular interactions was assumed to be an unsolvable problem until recently. In this work, we use transcription factor-target gene regulatory interactions in three different organisms to evaluate a new family of methods that, given observational data for just two causally related variables, can determine which one is the cause and which one is the effect.

### Results

We have found that a particular family of causal orientation methods (IGCI Gaussian) is often able to accurately infer directionality of causal interactions, and that these methods usually outperform other causal orientation techniques. We also introduced a novel ensemble technique for causal orientation that combines decisions of individual causal orientation methods. The ensemble method was found to be more accurate than any best individual causal orientation method in the tested data.

### Conclusions

This work represents a first step towards establishing context for practical use of causal orientation methods in the genomics domain. We have found that some causal orientation methodologies yield accurate predictions of causal orientation in genomics data, and we have improved on this capability with a novel ensemble method. Our results suggest that these methods have the potential to facilitate reconstruction of molecular pathways by minimizing the number of required randomized experiments to find causal directionality and by avoiding experiments that are infeasible and/or unethical.

### Similar content being viewed by others

## Background

The discovery of molecular pathways that drive diseases and vital cellular functions is a fundamental activity of biomedical research. Unraveling disease pathways is a major component in the efforts to develop new therapies that will effectively fight deadly diseases. Furthermore, knowing pathways significantly facilitates the design of personalized medicine modalities for diagnosis, prognosis, and management of diseases. The discovery of pathways is a challenging problem and its solution to a large extent relies on the identification of *causal* molecular interactions in genomics data.

By causal molecular interactions or relations we mean interactions of molecular variables that match the notion of randomized controlled experiment, which is the de facto standard for assessing causation in the general sciences and biomedicine [1–5]. Assume that a hypothetical experimenter can change the distribution of a variable X (i.e., experimentally manipulate it). We say that X is a cause of Y (and Y is an effect of X) and denote this by X→Y if the probability distribution of Y changes for some experimental manipulation of X.

Causal molecular interactions can be discovered using randomized experiments such as interference with RNA (e.g., shRNA, siRNA); however such experiments are often costly, infeasible, or unethical. Fortunately, over the last 20 years many algorithms that infer causal interactions from *observational* data have been developed [1–5] and some of them have been adopted to the high dimensionalities of modern genomics data [6, 7]. Outside of biomedicine, two Nobel prizes have recently been awarded in 2003 and 2011 for methods which seek to discover causal relations from non-experimental data [8–11].

In our prior work we evaluated the ability of state-of-the-art causal discovery algorithms to de-novo identify *unoriented* edges in genome-scale regulatory networks [12], which represent causal interactions between transcription factors and their target genes without distinguishing the mechanistic role of the involved molecular variables (i.e., we did not assess which genes were transcription factors and which genes were their targets). We deliberately avoided performing causal orientation of the discovered unoriented edges (i.e., separating transcription factors/causes from their target genes/effects) because this problem has previously been deemed worst-case unsolvable in observational data using existing algorithms [1] due to the statistical indistinguishability of causal models in the same Markov equivalence class. For example, causal models X→Y and X←Y have generally been assumed to be statistically indistinguishable given only observational data for X and Y.

Over the last 5 years researchers have developed a new class of methods that, given observational data for just two causally related variables X and Y, aim to determine which variable is the cause and which one is the effect (e.g., separate X→Y from X←Y) [13–18]. These causal orientation methods aim to solve the problem of statistical indistinguishability of graphs in the Markov equivalence class by *exploiting asymmetries in the shapes of the conditional probability densities* and *without requiring randomized experiments*. These methods could have significant implications for the field of causal discovery because they can orient unoriented edges that are discoverable by other established techniques, e.g. Generalized Local Learning (GLL) or Local-to-Global Learning (LGL) [6, 7]. Therefore, at face value, these causal orientation methods have the potential to reduce the number of, and in some cases even eliminate, randomized experiments needed for causal orientation of edges in the Markov equivalence class of graphs and make the causal model fully identifiable from observational data alone.

As promising as these new causal orientation methods are, they have not been previously applied in genomics, where the data is usually noisy and the sample sizes are relatively small compared to prior test applications of these methods [13–18]. In this paper we report results of an extensive study of recent causal orientation techniques in the genomics domain by (i) testing 12 methods/variants to distinguish cause (in our experiments, transcription factor) from effect (in our experiments, target gene) in 5,739 causal interactions and (ii) conducting sensitivity analyses with respect to noise and sample size for the best-performing methods. In addition, (iii) we introduce a new ensemble technique for causal orientation that is shown to be more accurate than any best individual causal orientation method in the tested data. The results of this study can serve as a foundation for further development of causal orientation techniques for genomics data and establishing a context for wide applications in molecular or biomedical research.

## Methods

### Causal orientation methods

As mentioned above, the purpose of the tested causal orientation methods is to separate cause from effect given data for just two variables X and Y *that have a causal relation* (i.e., in the underlying data generative distribution, either X → Y or X ← Y). Typically these methods are not designed to be used to causally orient pairs of variables that only have univariate association/correlation due to the possibility of confounding. For example, in the majority of distributions, a causal structure X ← T → Y implies that X and Y are associated even though they are not causally affecting each other. Therefore, the presence of association in a pair of variables X and Y is, in general, a necessary but not sufficient condition to be eligible for causal orientation. A rigorous approach would involve first using correct causal discovery methods (e.g., GLL/LGL [6, 7], MMHC [19], PC/FCI [1], IC/IC* [2], etc.) to identify unoriented edges that denote the existence of unconfounded causal relations and then applying causal orientation techniques to orient unoriented edges.

While each causal orientation method has its own principles and sufficient assumptions that are outlined in Table 1 (and brief descriptions of the algorithms are given in the Additional file 1), most of these techniques are based on the idea that the factorization of the joint probability distribution P(cause,effect) into P(cause)P(effect|cause) yields a simpler representation than the factorization into P(effect)P(cause|effect). One can furthermore show that if the marginal probability distribution of the cause P(cause) is independent of the causal mechanism P(effect|cause), then the factorization P(cause)P(effect|cause) has lower complexity than the factorization P(effect)P(cause|effect). Given two causally related variables X and Y, estimating the complexity of the two different factorizations of P(X,Y) or determining independence between marginal and conditional distributions can thus provide the basis for causal orientation techniques. In practice, however, it is difficult to directly test independence between P(X) and P(Y|X) or estimate (or even define a measure of) their complexity; hence the methods typically use simplifying assumptions or rely on approximate formulations.

The majority of tested causal orientation methods (IGCI, LINGAM, GPI-MML, ANM-MML, ANM-GAUSS) output two scores indicating likelihood of the forward causal model (X → Y) and the backward one (X ← Y). Other tested methods (ANM, PNL, GPI) output two p-values indicating significance of the forward and backward causal models. In order to make all methods directly comparable to each other, we decided to force them to make causal orientation decisions for all tested causal interactions. This was achieved by comparing scores or p-values of the forward and backward causal models and selecting the most likely orientation. This approach follows the practices of previously published applications of causal orientation methods by their inventors [16–18].

We also note that an alternative approach for the ANM, PNL, and GPI methods is to select a model (forward/backward) that is statistically significant at a given alpha level. The latter approach can sometimes improve accuracy of the causal orientation method while reducing the fraction of causally oriented edges [17]. While results for this approach are not central to this manuscript, we report them in the Additional file 1. We also note that the main findings for the primary approach are for the most part consistent with the alternative approach.

Finally, prior to application of the causal orientation methods, we standardized the data to mean zero and standard deviation one.

### Gold standard construction and observational data

The primary challenges in evaluating causal orientation methods for genomics applications are (i) limited availability of known gold standards of causal molecular interactions and (ii) limited sample sizes of the available observational data. To overcome these challenges we focused on transcription factor-gene regulatory interactions that have been discovered on the genome wide level and experimentally verified in model organisms [20, 21] and more recently in human cell lines [22]. Therefore, the gold standards in this work contain tuples of genes (X, Y) with orientation X→Y, where X is a transcription factor and Y is its target gene.

We used the following four gold standards: (i) interactions of the NOTCH1 transcription factor and its target genes in human T-cell acute lymphoblastic leukaemia (denoted as NOTCH1); (ii) interactions of the RELA transcription factor and its target genes in human T-cell acute lymphoblastic leukaemia (denoted as RELA); (iii) interactions of 140 transcription factors and their target genes in Escherichia coli (denoted as ECOLI); and (iv) interactions of 115 transcription factors and their target genes in Saccharomyces cerevisiae (denoted as YEAST). We used microarray gene expression data from the public domain for orientation of causal relations in each gold standard. The summary statistics of gold standards and corresponding microarray gene expression datasets are given in Table 2 and Table 3, and details of gold standard creation are provided below.

Once each of the gold standards was constructed, we removed interactions without statistically significant associations (in the observational data) according to Fisher's Z-test [23] at 5% FDR [24, 25]. We performed this filtering because presence of association is a necessary condition for detecting causal relations in most practical settings.

All gold standards and microarray gene expression datasets are available for download from http://www.nyuinformatics.org/downloads/supplements/CausalOrientation.

** Creation of NOTCH1 and RELA gold standards**: These gold standards contain genes that are directly downstream of a particular transcription factor (NOTCH1 or RELA) and are functionally regulated by it. The gold standards were obtained using the method described in [22].

Functional gene expression data was first used to identify genes that are downstream (but not necessarily directly) of a particular transcription factor. The samples in such data are randomized to either 'experiment' (e. g., siRNA knockdown of the transcription factor of interest) or 'control' treatment. All genes that are differentially expressed between 'experiment' and 'control' treatments are expected to be downstream of the transcription factor. We have used a t-test with α = 0.05 to identify such genes.

Genome-wide binding data (ChIP-on-chip for NOTCH1 and ChIP-sequencing for RELA) was then employed to identify direct binding targets of each transcription factor. Specifically, for each studied transcription factor we obtained the set of genes with detected promoter region-transcription factor binding according to the primary study that generated binding data.

We note that using genome-wide binding data by itself is insufficient to find downstream functional targets of a transcription factor, because many binding sites can be non-functional [26]. Therefore, the final step in gold standard creation required overlapping the list of direct binding targets (from binding data) with the list of downstream functional targets (from gene expression data). Knowledge gained by integration of data from these two sources is believed to provide high confidence that a given transcription factor directly regulates a particular gene [27]. Also, integration of data from two different sources contributes to the reduction of false positives in the resulting gold standards.

** Creation of gold standard for YEAST and ECOLI:** These gold standards contain genes that bind to and are likely to be regulated by the known transcription factors in Saccharomyces cerevisiae and Escherichia coli.

The Saccharomyces cerevisiae (denoted as YEAST) gold standard was built by identifying the promoter sequences that are both bound by transcription factors according to ChIP-on-chip data at 0.001 alpha level and conserved within 2 related species in the Saccharomyces genus [12, 21]. Binding information is essential because transcription factors must first bind to a gene to induce or suppress expression, while conservation information is important because true-positive transcription factor-DNA interactions are often conserved within a genus.

The Escherichia coli (denoted as ECOLI) gold standard was constructed from RegulonDB (version 6.4), a manually curated database of regulatory interactions obtained mainly through a literature search [12, 20]. ChIP-qPCR data has shown RegulonDB to be approximately 85% complete [28, 29]. Evidence for each regulatory interaction in RegulonDB is classified as "strong" or "weak", depending on the type of experiment used to predict the interaction. For example, binding of a transcription factor to a promoter is considered strong evidence, whereas gene-expression based computational predictions are considered weak evidence. For the purposes of our study, we created a conservative gold standard of only strong interactions.

The gold standards YEAST and ECOLI contain relations of the type "transcription factor → gene" and "transcription factor → transcription factor" (where "gene" refers to a target gene that is not a transcription factor). We decided to simplify the setting of our evaluation when we assess whether the inferred causal orientation X→Y or X←Y is correct, and restricted attention to only interactions of the type "transcription factor → gene". This results in minimizing the number of cases with feedback that can be represented by causal edges in both directions. Note that it is not currently possible to comprehensively apply this filtering step to NOTCH1 and RELA gold standards because the transcription factors are not well known in human cells.

### Performance metrics and statistical significance testing

Two metrics were used to assess performance of causal orientation algorithms. The first metric is accuracy which is the percentage of causal interactions that have been oriented correctly. A method that orients all causal interactions in the gold standard as "transcription factor → gene" would achieve an accuracy of 1; a method that orients all interactions as "gene → transcription factor" would achieve an accuracy of 0; and a method that flips a fair coin to make every orientation decision would on average achieve an accuracy of 0.5.

The second metric is area under ROC curve (AUC), which is known to be more discriminative than accuracy because it takes into account the confidence of orientation decisions [30, 31]. The ROC curve is the plot of sensitivity versus 1-specificity for a range of threshold values on the difference between the scores/p-values of the forward and backward causal models [32]. AUC ranges from 0 to 1, where AUC = 1 corresponds to perfectly correct prediction of causal orientation, AUC = 0.5 corresponds to prediction by chance, and AUC = 0 corresponds to completely incorrect prediction of causal orientation. Computation of sensitivity/specificity and AUC requires a response variable with both positive and negative labels. We created such a response variable by representing gold standard edges (that always point from a transcription factor to its target gene) in the following two equivalent ways: 50% of the edges were represented as "transcription factor → gene" and the other 50% were represented as "gene ← transcription factor". The edges "→" were labeled as positives and "←" were labeled as negatives. This process is illustrated in Table 4; in particular note that the direction of causality always points from transcription factor to gene. AUC was then computed according to the formula given in [33], with the difference in scores/p-values serving as a predictor. Note that the AUC can also be interpreted as the probability that the difference between scores/p-values of the forward and backward causal models for a randomly chosen positive instance is higher than the difference between scores/p-values for a randomly chosen negative instance. Since each of the four gold standards has a large number of edges (>500), the variance in AUC due to different choices of edges for negative and positive labels is minimal and typically smaller than 0.001 AUC, as estimated by computation of AUC for 1,000 random choices of positive and negative labels.

Given values of the above two performance metrics (accuracy and AUC), we need to assess their statistical significance, i.e. find out if the causal orientation is better than by chance. Notice that our gold standards are such that many causal edges are not independent because they share the same transcription factor. That is why we chose to apply an exact statistical testing procedure that can accurately estimate the variance of orientation by chance in our setting [34]. A schematic illustration of the statistical testing procedure is given in Figure 1. First we compute AUC using real gene expression data (Figure 1-a). Then we replace the real data with random data from the Normal distribution with mean 0 and standard deviation 1 (the null distribution) and compute AUC for the *same gold standard* as used with the real data. This step is repeated with 1,000 different randomly generated datasets (Figure 1-b). Finally, we compare AUCs from the null distribution to the AUC obtained in the real data and compute a p-value that corresponds to the proportion of random datasets that yield higher AUCs than the real data (Figure 1-c). A downside of the above statistical testing procedure is that it is computationally expensive and requires running each causal orientation method 5,744,739 times (= 5,739 causal interaction · 1,001 datasets) in order to assess its significance in all 4 gold standards used in our study. To make this analysis feasible, we assessed the statistical significance of only the two best performing methods (IGCI Gaussian/Entropy and IGCI Gaussian/Integral) and utilized the Asclepius Compute Cluster at the Center for Health Informatics and Bioinformatics (CHIBI) at New York University Langone Medical Center (http://www.nyuinformatics.org).

### Methodology for sensitivity analyses

In order to study sensitivity to sample size (number of observations), we sample without replacement from the original gene expression data nested subsets of size 10, 20, 30, ..., N, where N is the number of samples in the dataset. Specifically, the subset of size 10 is included in the subset of size 20, which is in turn included in the subset of size 30, and so on. We then run the causal orientation algorithms on each subset and compute performance. This process is repeated with different sampled nested subsets, and mean performance and variance are estimated over all runs. For the NOTCH1 and RELA gold standards, we used 100 subsets of each size, while for the more computationally expensive YEAST and ECOLI gold standards we used 20 subsets of each size.

For the sensitivity analysis to noise, we add a certain proportion (p) of random Gaussian noise to the gene expression data for both transcription factors and their target genes, run causal orientation methods in the noisy data, report their performance, and repeat the entire process to assess variance (again, 100 times for NOTCH1 and RELA and 20 times for YEAST and ECOLI). Denoting by X the transcription factor and by Y its target gene, the noisy transcription factor X′ and gene Y′ are defined as follows: X′ = (1-p) · X + p · N(M_{X},S_{X}) and Y′ = (1-p) · Y + p · N(M_{Y},S_{Y}), where N(m,s) is a Normally distributed random variable with mean m and standard deviation s, and M_{X}, M_{Y} and S_{X}, S_{Y} are means and standard deviations of X and Y in the original data (prior to noise addition). We use the following proportions of noise (values of p): 0.05, 0.10, 0.15, ..., 0.90, 0.95, 1.00.

### A new ensemble method for causal orientation

As an enhancement to using individual causal orientation techniques, we introduce ensemble causal orientation models that combine decisions of all available individual causal orientation methods in order to produce a more powerful predictor of causal orientation. These methodologies are popular in the field of supervised learning, where non-random weak learners are often combined to produce a more accurate predictor [35]. The use of ensemble modelling for causal orientation is motivated by our empirical observations that there is no single causal orientation methodology that performs perfectly (i.e., with 1.0 accuracy or AUC), many causal orientation methods appear to perform different than chance, and causal orientation methods often make errors in orienting *different* edges.

In this study we experimented with a straightforward approach to ensemble modelling, where we train a logistic regression model [36] on predictions of all 11 tested causal orientation methods (i.e., the input features consist of the differences in scores/p-values between forward and backward models). For the response variable, we follow the same approach as for the computation of AUC which was described above in the subsection on performance metrics. Namely, 50% of edges are represented as "transcription factor → gene" and the other 50% as "gene ← transcription factor". Then the response variable is constructed by labelling "→" edges as positives and "←" edges as negatives.

As with every supervised learning procedure that is trained and tested using the same dataset, it is essential to split the available data into non-overlapping training and testing sets, whereby the training set is used to fit a learning model and the testing set is used to estimate its performance [37, 38]. For each gold standard we used 30% of the causal interactions (chosen at random) for training and the remaining 70% for testing. We decided to use a training set that was smaller than the testing set so that our study resembles a possible practical application where only a small portion of the gold standard is known and the rest is to be discovered. The predictions of the ensemble model in each gold standard are compared with the predictions of the best individual causal orientation technique in the same testing set with 70% of the data (to ensure that the results are directly comparable).

Finally, in addition to exploring holdout validation performance of the ensemble models, we trained and tested the ensemble models on different gold standards. In practice this approach can be justified if the data distributions in the gold standards used for training and testing of ensemble models are similar. It also resembles a practical situation when a gold standard is known in a previously studied dataset but is not known in a new but distributionally similar one.

## Results

### Evaluating causal orientation methods with the accuracy metric

The causal orientation accuracy values are given in Table 5 for 12 causal orientation methods (including orientation by flipping a fair coin which is denoted as "RANDOM" in the table) and 4 gold standards. The performance ranks of methods with accuracies higher than 0.50 are given in Table 6.

As can be seen, IGCI Gaussian/Entropy and IGCI Gaussian/Integral methods achieve the highest accuracies in each of the four gold standards. In general, the other causal orientation methods perform worse, and some methods (e.g., ANM-MML) consistently prefer wrong decisions and have accuracies lower than 0.5.

Interestingly, if we consider the best performing method (IGCI Gaussian/Entropy) with the average rank 1.25, its results are statistically significant at alpha = 0.05 according to the exact test (described in the Methods section) only for the ECOLI gold standard (p-value < 0.001). The second best performing method (IGCI Gaussian/Integral) with the average rank 1.75 achieves significance in two out of four gold standards (p-value < 0.001 for ECOLI and 0.003 for YEAST) at alpha = 0.05. The reason why the IGCI Gaussian/Integral method achieves significance in more gold standards than the best performing technique IGCI Gaussian/Entropy is the small variance of the former method. The detailed statistical significance results including null distributions are given in the Additional file 1 in Figure S1 (for IGCI Gaussian/Entropy) and Figure S3 (for IGCI Gaussian/Integral). It is worth noting that statistical significance was achieved by neither of the two best performing methods in NOTCH1 and RELA gold standards that have only 1 transcription factor. This was primarily due to a large variance of causal orientation accuracies in the null distribution (see Figures S1 and S3 in the Additional file 1). On the other hand, if we join these two gold standards into one with 2 transcription factors, both methods achieve statistical significance at alpha = 0.05 (p-value of IGCI Gaussian/Entropy is 0.018 and p-value of IGCI Gaussian/Integral is 0.007).

The superior and often statistically significant performance of the two IGCI methods compared to other techniques was a surprising finding that we did not expect theoretically. IGCI assumes a noise free model (Table 1) that is unrealistic in genomics data. On the other hand, other methods that have *a priori* more realistic assumptions perform worse. We hypothesize that sufficient assumptions of the IGCI methods are too strict in practice and can be mitigated in many ways that are currently not well understood.

### Evaluating causal orientation methods with the AUC metric

The causal orientation AUC values are given in Table 7 for 12 causal orientation methods (including orientation by flipping a fair coin which is denoted as "RANDOM" in the table) and 4 gold standards. The performance ranks of methods with AUCs higher than 0.50 are given in Table 8.

Similarly to the accuracy results, IGCI Gaussian/Entropy and IGCI Gaussian/Integral methods achieve the highest AUCs in each of the four gold standards. Other causal orientation methods perform worse, and some methods (e.g., ANM-MML) consistently prefer wrong decisions and have AUCs lower than 0.5.

The statistical significance analysis of IGCI Gaussian/Entropy and IGCI Gaussian/Integral is described in detail in the Additional file 1 in Figure S2 (for IGCI Gaussian/Entropy) and Figure S4 (for IGCI Gaussian/Integral). In summary, both methods achieve statistical significance of causal orientations (at alpha = 0.05) in ECOLI and YEAST, only IGCI Gaussian/Integral achieves significance in RELA, and none of the two methods achieves significance in NOTCH1. On the other hand, similarly to results for the accuracy metric, both methods achieve statistically significant results in the joined NOTCH1 and RELA gold standard with 2 transcription factors.

### Sensitivity analysis to noise

The results of sensitivity analysis to noise for the two best performing methods (IGCI Gaussian/Entropy and IGCI Gaussian/Integral) are given in Figures 2, 3, 4, 5 for NOTCH1, RELA, ECOLI, and YEAST gold standards, respectively. In all gold standards except for YEAST, the accuracy of the methods decreases with increasing noise proportion. On the other hand, in YEAST gold standard the performance of causal orientation methods significantly *increases* when a small amount of noise is added, and then gradually decreases for higher proportions of noise. The Additional file 1 provides a detailed analysis of this phenomenon.

Whereas in NOTCH1 and RELA gold standards it takes only 5-10% of noise to make the results statistically indistinguishable from orientation by chance, in YEAST and ECOLI gold standards the methods can tolerate much higher proportions of noise and still produce statistically significant results. This can be attributed to a larger number of transcription factors in YEAST and ECOLI gold standards, as well as larger sample sizes in the corresponding datasets which both decrease the variability of the results.

A decrease in performance upon the addition of noise is theoretically expected since IGCI assumes a noise-free model, and the addition of Gaussian noise violates its sufficient assumptions. Also, as can be seen in the figures, the IGCI Gaussian/Integral method has lower variance than the IGCI Gaussian/Entropy method. The above results are consistent with our prior findings and statistical significance testing by the exact test (see Figures S1-S4 in the Additional file 1).

### Sensitivity analysis to sample size

The results of sensitivity analysis to sample size for the two best performing methods (IGCI Gaussian/Entropy and IGCI Gaussian/Integral) are given in Figures 6, 7, 8, 9 for NOTCH1, RELA, ECOLI, and YEAST gold standards, respectively. In all gold standards except for YEAST, the accuracy of the methods decreases as the sample size gets smaller. On the other hand, in YEAST gold standard the performance of causal orientation methods slightly increases by reducing the sample size and then gradually decreases for smaller sample sizes. The Additional file 1 provides a detailed analysis of this phenomenon.

Whereas in NOTCH1 and RELA gold standards results become statistically indistinguishable from orientation by chance when the sample size is <80-100, in YEAST and ECOLI gold standards the methods yield statistically significant results for smaller sample sizes. This can be attributed to a larger number of transcription factors in YEAST and ECOLI gold standards which decreases variability of the results.

Also, as can be seen in the figures, the IGCI Gaussian/Integral method has lower variance than the IGCI Gaussian/Entropy method. The above results are consistent with our prior findings in statistical significance testing by the exact test (see Figures S1-S4 in the Additional file 1).

### Ensemble causal orientation

For each gold standard, Table 9 compares the AUC achieved by the best individual causal orientation method to the AUC achieved by the ensemble method, which combines the predictions of all 11 methods using a logistic regression model. A detailed description of the ensemble modeling methodology is given in the Methods section. As can be seen, ensemble causal orientation achieves higher values of AUC than any individual causal orientation method in all four gold standards. It is worthwhile to highlight the magnitude of the improvement in the YEAST gold standard: the ensemble approach improves performance over the best individual causal orientation method (IGCI Gaussian/Integral) by 0.164 AUC. Table 10 provides coefficients for the ensemble logistic regression model in the YEAST gold standard. Bold values correspond to coefficients that are statistically significant at 0.05 alpha level. The model preserves its performance (0.822 AUC) when it is trained/tested using only the 5 causal orientation methods that have statistically significant coefficients. Therefore the improvement in AUC in the YEAST gold standard can be attributed to effectively combining the IGCI and ANM-MML causal orientation predictions by the logistic regression ensemble model.

The above results were obtained by holdout validation where we used different portions of the same gold standard for training and testing ensemble models. We also experimented with training and testing ensemble models on different gold standards. First we experimented with the RELA and NOTCH1 gold standards that were derived from the same organism and phenotype, and thus are likely to be distributionally similar and support cross-gold standard application of the ensemble model. We find that an ensemble logistic regression model trained on RELA obtains AUC = 0.996 when tested on NOTCH1, and likewise an ensemble model trained on NOTCH1 obtains AUC = 0.989 when tested on RELA. Both these results significantly improve performance over the best individual causal orientation method (IGCI Gaussian/Entropy) in both NOTCH1 and RELA gold standards (with p-values <0.0001).

In addition, we experimented with the YEAST and ECOLI gold standards which originate from different organisms and thus are unlikely to be distributionally similar; for this reason they *a priori* question cross-gold standard application of the ensemble model. Indeed, our results confirm this expectation: an ensemble logistic regression model trained on YEAST performs with AUC = 0.4833 in ECOLI, and an ensemble model trained in ECOLI performs with AUC = 0.5916 in YEAST. Neither of these results improves the best individual causal orientation method in the respective gold standard. This suggests that the success of cross-gold standard application of ensemble models is grounded on similarity of the underlying distributions.

## Discussion

This work represents the first comprehensive effort to evaluate performance and furthermore enhance the recently introduced causal orientation methods [13–18] in genomics data. One of the main challenges is the limited availability of gold standards of causal molecular interactions. That is why we have focused on regulatory interactions between transcriptions factors and their target genes that have been recently identified on a genome-wide level in model organisms and human cell lines. These interactions have a well-defined causal directionality (from a transcription factor to its target gene) and can be readily used for an evaluation study such as ours. However, it is possible that some edges in the gold standards (especially, NOTCH1 and RELA) have causal relationships in both directions due to feedback mechanisms. Since the signal in the direction from a transcription factor to its target gene is expected to be stronger than in the opposite direction (due to attenuation in the signal transduction pathways), we are implicitly assuming that in such cases a causal orientation method would prefer the direction from transcription factor to gene. If this assumption is not true, this does not invalidate results of the methods (because the direction from transcription factor to gene is valid) but provides additional explanations as to why some methods prefer the opposite causal direction.

Even though the choice of the gold standard with transcription factor-gene regulatory interactions enables this study, its practical relevance may be limited in the organisms/settings where all transcription factors have already been identified. That is why we plan to work on extending this evaluation to other types of causal molecular interactions, for example in cellular protein signaling networks [39].

In this study we have implicitly assumed that unoriented edges (representing causal interactions between a transcription factor and its target gene without specifying which of the two genes is a transcription factor and which is its regulatory target) are given by an Oracle and we have evaluated performance of only causal orientation methods. However, in practical tasks one typically has to both discover and orient edges. Although we have previously evaluated methods for discovery of unoriented edges [12], it will be interesting to assess the performance of the two classes of methods (for edge discovery and for its orientation) when they are applied together.

Finally, we think that a fruitful area of research will be to extend this study by comparison with classical causal orientation techniques that output Markov equivalence classes of graphs (based on v-structures with constraint propagation) and thus, in general, can orient only a subset of edges in the graph [1].

## Conclusions

In this paper we have taken a first step toward practical use of recent causal orientation techniques in the genomics domain. First of all, we report results of an extensive study of causal orientation methods in genomics data that utilized 12 methods/variants to distinguish cause (transcription factor) from effect (target gene) in 5,739 causal interactions. We have found that IGCI Gaussian methods [16, 17] often accurately infer directionality of the causal interaction, and they outperform other causal orientation techniques. In addition, we have performed sensitivity analyses that allow us to empirically establish the minimal requirements for the sample size and maximal level of noise that can be tolerated by the best performing causal orientation techniques. Second, we described a novel ensemble technique for causal orientation that combines decisions of individual causal orientation methods to provide a more powerful predictor of causal directionality. The proposed ensemble method was found to be more accurate than any best individual causal orientation method in the tested data. In summary, our results suggest that causal orientation methods have significant potential to facilitate reconstruction of molecular pathways by minimizing the number of required randomized experiments to find causal directionality and by avoiding experiments that are infeasible and/or unethical.

## References

Spirtes P, Glymour CN, Scheines R: Causation, prediction, and search. 2000, Cambridge, Mass: MIT Press, 2:

Pearl J: Causality: models, reasoning, and inference. 2009, Cambridge, U.K: Cambridge University Press, 2:

Glymour CN, Cooper GF: Computation, causation, and discovery. 1999, Menlo Park, Calif: AAAI Press

Neapolitan RE: Learning Bayesian networks. 2004, Upper Saddle River, NJ: Pearson Prentice Hall

Pearl J: Probabilistic reasoning in intelligent systems: networks of plausible inference. 1988, San Mateo, California: Morgan Kaufmann Publishers

Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part I: Algorithms and Empirical Evaluation. Journal of Machine Learning Research. 2010, 11: 171-234.

Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD: Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification. Part II: Analysis and Extensions. Journal of Machine Learning Research. 2010, 11: 235-284.

Granger CWJ: Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society. 1969, 424-438.

The Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel 2003. [http://www.nobelprize.org/nobel_prizes/economics/laureates/2003/]

Sims CA: Money, income, and causality. The American Economic Review. 1972, 62 (4): 540-552.

The Prize in Economic Sciences 2011. [http://www.nobelprize.org/nobel_prizes/economics/laureates/2011/]

Narendra V, Lytkin NI, Aliferis CF, Statnikov A: A comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks. Genomics. 2011, 97 (1): 7-18. 10.1016/j.ygeno.2010.10.003.

Shimizu S, Hoyer PO, Hyvarinen A, Kerminen A: A Linear Non-Gaussian Acyclic Model for Causal Discovery. The Journal of Machine Learning Research. 2006, 7: 2003-2030.

Hoyer PO, Janzing D, Mooij J, Peters J, Scholkopf B: Nonlinear causal discovery with additive noise models. Advances in Neural Information Processing Systems. 2009, 21: 689-696.

Zhang K, Hyvärinen A: Distinguishing causes from effects using nonlinear acyclic causal models. Journal of Machine Learning Research, Workshop and Conference Proceedings (NIPS 2008 causality workshop). 2008, 6: 157-164.

Daniusis P, Janzing D, Mooij J, Zscheischler J, Steudel B, Zhang K, Schölkopf B: Inferring deterministic causal relations. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI-2010). 2010, 143-150.

Janzing D, Mooij J, Zhang K, Lemeire J, Zscheischler J, Daniusis P, Steudel B, Scholkopf B: Information-geometric approach to inferring causal directions. Artificial Intelligence. 2012,

Mooij JM, Stegle O, Janzing D, Zhang K, Schölkopf B: Probabilistic latent variable models for distinguishing between cause and effect. Advances in Neural Information Processing Systems. 2010, 1-23

Tsamardinos I, Brown LE, Aliferis CF: The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning. 2006, 65 (1): 31-78. 10.1007/s10994-006-6889-7.

Gama-Castro S, Jimenez-Jacinto V, Peralta-Gil M, Santos-Zavaleta A, Penaloza-Spinola MI, Contreras-Moreira B, Segura-Salazar J, Muniz-Rascado L, Martinez-Flores I, Salgado H, et al: RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 2008, 36 (Database issue): D120-D124.

MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E: An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMCBioinformatics. 2006, 7: 113-

Shmelkov E, Tang Z, Aifantis I, Statnikov A: Assessing quality and completeness of human transcriptional regulatory pathways on a genome-wide scale. BiolDirect. 2011, 6: 15-

Anderson TW: An introduction to multivariate statistical analysis. 2003, Hoboken, N.J: Wiley-Interscience, 3:

Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical SocietySeries B (Methodological). 1995, 57 (1): 289-300.

Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. AnnStatist. 2001, 29 (4): 1165-1188.

Li XY, MacArthur S, Bourgon R, Nix D, Pollard DA, Iyer VN, Hechmer A, Simirenko L, Stapleton M, Luengo Hendriks CL, et al: Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol. 2008, 6 (2): e27-10.1371/journal.pbio.0060027.

Kirmizis A, Farnham PJ: Genomic approaches that aid in the identification of transcription factor target genes. Experimental biology and medicine. 2004, 229 (8): 705-721.

Stolovitzky G, Prill RJ, Califano A: Lessons from the DREAM2 Challenges. AnnNYAcadSci. 2009, 1158: 159-195.

Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS: Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007, 5 (1): e8-10.1371/journal.pbio.0050008.

Ling CX, Huang J, Zhang H: AUC: a statistically consistent and more discriminating measure than accuracy. Proceedings of the Eighteenth International Joint Conference of Artificial Intelligence (IJCAI). 2003

Ling CX, Huang J, Zhang H: AUC: a better measure than accuracy in comparing learning algorithms. Proceedings of the Sixteenth Canadian Conference on AI. 2003

Fawcett T: ROC Graphs: Notes and Practical Considerations for Researchers. Technical Report, HPL-2003-4, HP Laboratories. 2003

Hand DJ, Till RJ: A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning. 2001, 45 (2): 171-186. 10.1023/A:1010920819831.

Good PI: Permutation tests: a practical guide to resampling methods for testing hypotheses. 1994, New York: Springer-Verlag

Dietterich TG: Ensemble methods in machine learning. Proceedings of the First International Workshop on Multiple Classifier Systems. 2000, 1-15.

McCullagh P, Nelder JA: Generalized linear models. 1989, London: Chapman and Hall, 2:

Weiss SM, Kulikowski CA: Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems. 1991, San Mateo, Calif: M. Kaufmann Publishers

Hastie T, Tibshirani R, Friedman JH: The elements of statistical learning: data mining, inference, and prediction. 2001, New York: Springer

Sachs K, Perez O, Pe'er D, Lauffenburger DA, Nolan GP: Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005, 308 (5721): 523-529. 10.1126/science.1105809.

Margolin AA, Palomero T, Sumazin P, Califano A, Ferrando AA, Stolovitzky G: ChIP-on-chip significance analysis reveals large-scale binding and regulation by human transcription factor oncogenes. ProcNatlAcadSciUSA. 2009, 106 (1): 244-249.

Espinosa L, Cathelin S, D'Altri T, Trimarchi T, Statnikov A, Guiu J, Rodilla V, Ingles-Esteve J, Nomdedeu J, Bellosillo B, et al: The Notch/Hes1 pathway sustains NF-kappaB activation through CYLD repression in T cell leukemia. Cancer Cell. 2010, 18 (3): 268-281. 10.1016/j.ccr.2010.08.006.

Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, et al: Variation in transcription factor binding among humans. Science. 2010, 328 (5975): 232-235. 10.1126/science.1183621.

Faith JJ, Driscoll ME, Fusaro VA, Cosgrove EJ, Hayete B, Juhn FS, Schneider SJ, Gardner TS: Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata. Nucleic Acids Res. 2008, 36 (Database issue): D866-D870.

Kohlmann A, Kipps TJ, Rassenti LZ, Downing JR, Shurtleff SA, Mills KI, Gilkes AF, Hofmann WK, Basso G, Dell'orto MC, et al: An international standardization programme towards the application of gene expression profiling in routine leukaemia diagnostics: the Microarray Innovations in LEukemia study prephase. British journal of haematology. 2008, 142 (5): 802-807. 10.1111/j.1365-2141.2008.07261.x.

DeLong ER, DeLong DM, Clarke-Pearson DL: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988, 44 (3): 837-845. 10.2307/2531595.

## Acknowledgements

The authors would like to acknowledge Dominik Janzing and Joris M. Mooij, who contributed to the development of the majority of causal orientation methods used in this study, and thank them for providing (i) software implementations of causal orientation algorithms, (ii) help with stating assumptions of the tested methods, (iii) ideas about statistical significance testing approach, and (iv) feedback on other aspects of the manuscript and, in particular, interpretation of the results. The authors are also grateful to Efstratios Efstathiadis and Eric Peskin for the help with providing access and running experiments on the high performance computing facility. Finally, the authors would like to thank Ioannis Aifantis for providing experimental data for NOTCH1 that was used for the development of the corresponding gold standard. The empirical evaluation was supported in part by the grants 1R01LM011179-01A1 from the National Library of Medicine and 1UL1RR029893 from the National Center for Research Resources, National Institutes of Health.

This article has been published as part of *BMC Genomics* Volume 13 Supplement 8, 2012: Proceedings of The International Conference on Intelligent Biology and Medicine (ICIBM): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S8.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

AS conceived the study. All authors participated in the design of the study. MH and NIL performed all computational experiments. AS and MH drafted the manuscript. All authors edited, read, and approved the final manuscript.

## Electronic supplementary material

### 12864_2012_4463_MOESM1_ESM.pdf

Additional file 1: This file contains (1) brief description of causal orientation algorithms; (2) results of causal orientation methods ANM, PNL, and GPI obtained by assessing statistical significance of the forward and backward causal models; (3) detailed results of significance testing of IGCI Gaussian/Entropy and Gaussian/Integral methods; (4) explanation of performance increase due to adding small amount of noise or reducing the sample size in YEAST gold standard. (PDF 7 MB)

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Statnikov, A., Henaff, M., Lytkin, N.I. *et al.* New methods for separating causes from effects in genomics data.
*BMC Genomics* **13**
(Suppl 8), S22 (2012). https://doi.org/10.1186/1471-2164-13-S8-S22

Published:

DOI: https://doi.org/10.1186/1471-2164-13-S8-S22