A powerful nonparametric method for detecting differentially coexpressed genes: distance correlation screening and edgecount test
 151 Downloads
Abstract
Background
Differential coexpression analysis, as a complement of differential expression analysis, offers significant insights into the changes in molecular mechanism of different phenotypes. A prevailing approach to detecting differentially coexpressed genes is to compare Pearson’s correlation coefficients in two phenotypes. However, due to the limitations of Pearson’s correlation measure, this approach lacks the power to detect nonlinear changes in gene coexpression which is common in gene regulatory networks.
Results
In this work, a new nonparametric procedure is proposed to search differentially coexpressed gene pairs in different phenotypes from largescale data. Our computational pipeline consisted of two main steps, a screening step and a testing step. The screening step is to reduce the search space by filtering out all the independent gene pairs using distance correlation measure. In the testing step, we compare the gene coexpression patterns in different phenotypes by a recently developed edgecount test. Both steps are distributionfree and targeting nonlinear relations. We illustrate the promise of the new approach by analyzing the Cancer Genome Atlas data and the METABRIC data for breast cancer subtypes.
Conclusions
Compared with some existing methods, the new method is more powerful in detecting nonlinear type of differential coexpressions. The distance correlation screening can greatly improve computational efficiency, facilitating its application to large data sets.
Keywords
Distance correlation Edgecount test Differential coexpression Breast cancer subtypes Pathway analysis The cancer genome atlasAbbreviations
 BH
BenjaminiHochberg
 DC
Distance correlation
 DCE
Differential coexpression or differentially coexpressed
 DE
Differential expression or differentially expressed
 MST
Minimum spanning tree
 TCGA
The cancer genome atlas
Background
The vast majority of human diseases are complex diseases, in the sense that they are not the consequence of an abnormality of a single gene, but a result of changes in many genes. Thanks to the rapid advance of highthroughput technologies, researchers nowadays can investigate the association between a disease and tens of thousands of genes simultaneously. Two types of analysis, namely differential expression (DE) analysis and differential coexpression (DCE) analysis, have been extensively applied in genetic association studies [1, 2, 3, 4]. Differential expression analysis targets genes with differential expression levels in different phenotypes, while DCE analysis detects gene pairs or gene sets that are differentially associated or regulated in different groups. Over the past years, there have been considerable tools developed for DE analysis and other similar analyses such as differential methylation (DM) analysis. One can refer to Soneson and Delorenzi (2013) [5] for a comprehensive review and comparison of several most popular tools including edgeR, DESeq, TSPM, baySeq, EBSeq and ShrinkSeq. Despite the success of DE analysis, the progress on DCE analysis is relatively slow partially due to the combinatorial nature of the problem and the lack of powerful statistical test for comparing multidimensional patterns.
It should be noted that the test proposed here does not rely on any parametric assumption but generally targets all types of DCE. One can explicitly test H_{0} with a recently developed edgecount test [9]. However, unlike the Pearson’s correlation method, the new test requires several intermediate steps including the calculation of minimum spanning trees, therefore it could be less efficient when applied to largescale data. To overcome this difficulty, we use the distance correlation measure to screen out noncoexpressed (independent) gene pairs before the edgecount test, so that the search space can be greatly reduced. The distance correlation measure has appealing theoretical properties and can generally capture nonlinear associations. On the whole, we put forward a complete framework for DCE analysis which is effective and applicable to largescale expression data.
The rest of the paper is structured as follows: Section “Methods” reviews the technical details of distance correlation screening and edgecount test. Simulation studies are performed to compare the edgecount test with two existing approaches based on Pearson’s correlation and mutual information. In Section “Results”, we apply this new approach to the Cancer Genome Atlas (TCGA) data as well as the METABRIC data for the DCE analysis between four subtypes of breast cancer. We discuss the strengths and some possible extensions of the new approach in Section“Discussion” and conclude this paper in Section“Conclusions”.
Methods
Distance correlation screening
where d_{ x } and d_{ y } are the dimensions of X and Y, \(c_{d_{x}}=\frac {\pi ^{(1+d_{x})/2}}{\Gamma \{(1+d_{x})/2\}}\) and \(c_{d_{y}}=\frac {\pi ^{(1+d_{y})/2}}{\Gamma \{(1+d_{y})/2\}}\). Unless otherwise specified, \(\pmb {z}_{d_{z}}\) denotes the Euclidean norm of \(\pmb {z}\in \mathbb {R}^{d_{z}}\), and \(\phi ^{2} = \phi \bar {\phi }\) for a complexvalued function ϕ and its conjugate \(\bar {\phi }\).

Setting 1: X_{ i }∼N(0,1), Y_{ i }∼N(0,2), i=1,2,…,50,

Setting 2: X_{ i }∼Uniform(0,1), Y_{ i }∼Uniform(0,2), i=1,2,…,50.
The distance correlation measure has been applied in previous genomic studies to quantify gene coexpressions [15]. Besides DC, there are several measures that can pick up nonlinear dependence between variables, although each of them has its own practical limitations. Clark (2013) [16] empirically compared six popular measures including Pearson’s correlation, Spearman’s correlation, distance correlation, mutual information (MI), maximum information coefficient (MIC) and Hoeffding’s D under a variety of different settings, and it was found that the six methods perform almost equally well in detecting the linear correlation. However, under the nonlinear dependence, the distance correlation and MIC performed notably better than the other measures. There are two considerations that lead to the choice of DC instead of MIC in our analysis. First, DC is straightforward to calculate and not an approximation while MIC relies on a userdefined number of grids for approximation. Second, as pointed out in some recent studies [17, 18], the DC exhibits more statistical power than MIC under moderate or small sample sizes.
Edgecount test
Our testing step is to compare two multivariate distributions (dimension is 2 in DCE analysis). In statistics literature, there are mainly two types of multivariate tests, namely the multidimensional KolmogorovSmirnov (KS) test [19] and edgecount test [20, 21]. These two methods, however, both have practical limitations when applied to real data. For instance, KS test is very conservative, i.e., the null hypothesis is too often not rejected. Also, by the brute force algorithm, the application of multidimensional KS test can be prohibitively computationally intensive. The edgecount test is easy to implement but it is known to be problematic under the location and scale alternatives. Recently, Chen and Friedman [9] developed a modified version of edgecount test, which works properly under different alternatives and exhibits substantial power gains over existing edgecount tests. Similar as other edgecount tests, the new test is based upon a similarity graph such as minimum spanning tree (MST, [22]) that is constructed over the pooled samples from different groups. Generally, if two groups have different distributions, samples would be preferentially closer to others from the same group than those from the other group, therefore the edges in the MST would be more likely to connect samples from the same group. The test rejects the null if the number of betweengroup edges is significantly less than expected.
where \(C=\frac {1}{2}{\sum \nolimits }_{i=1}^{N}G_{i}^{2}G\), and G_{ k } stands for the subgraph in G that includes all edges that connect to node k. It was proved that under the permutation null hypothesis, S asymptotically follows a Chisquare distribution with 2 degrees of freedom [9]. The pvalue approximation generally works well under relatively small sample size, for instance, when min(n,m)=20. In their work, Chen and Friedman also suggested that the use of kMST graphs (e.g., 3MST or 5MST) may lead to a better approximation of pvalue in practice.
then the following theorem can be derived:
Theorem 1
where i=1,…,p is the group index.
where \(N={\sum \nolimits }_{k=1}^{p}n_{k}\) and \(C=\frac {1}{2}{\sum \nolimits }_{i=1}^{N}G_{i}^{2}G\). The detailed proof for Theorem 1 can be found in the Appendix of Zhang et al. (2017) [23].
Simulation study: edgecount test versus two existing approaches

Linear setting: \((X_{1}, Y_{1})^{T}\sim \mathrm {N}\left [\left (\begin {array}{l} 0\\ 0 \end {array}\right), \left (\begin {array}{ll} 1 & \rho \\ \rho & 1 \end {array}\right)\right ]\), \((X_{2}, Y_{2})^{T}\sim \mathrm {N}\left [\left (\begin {array}{l} 0\\ 0 \end {array}\right), \left (\begin {array}{cc} 1 & \rho +\Delta \\ \rho +\Delta & 1 \end {array}\right)\right ]\), where ρ=0.3, Δ∈{0.1,0.2,…,0.6}.

Nonlinear setting: X_{ i }∼Uniform(−2,2), \(Y_{i}=X_{i}^{2}+\epsilon _{i}\), \(\epsilon _{i}\sim \mathrm {N}\left (0, \sigma _{i}^{2}\right)\), i=1,2, σ_{1}=0.5, σ_{2}=σ_{1}+Δ, Δ∈{0.1,0.2,…,0.6}.
Our simulation study demonstrated the capability of our edgecount test in capturing both linear and nonlinear changes. Generally, the edgecount test performs similarly well as Pearson’s correlation and mutual information under linear setting but achieves significantly better sensitivity for nonlinear setting.
Results
In this section, we applied the twostep pipeline to search DCE genes in four subtypes of breast cancer using the Cancer Genome Atlas (TCGA) data. Four gene sets, including two KEGG gene pathways and two MSigDB hallmark gene sets, were used as illustrative examples. We validated our findings by the largescale METABRIC breast cancer data.
Data preparation
The remaining 959 samples were further classified into five subtypes according to two molecular signatures, namely PAM50 [29] and SCMOD2 [30]. The two classifications were implemented separately using R package genefu [31] and we obtained 530 subjects with concordant classification by two classifiers. The resulting set contains 221 subjects in luminal A group, 119 in luminal B group, 74 in her2enriched group, 105 in basallike group and 11 in normallike group. The normallike group was excluded from the analysis due to the low sample size and only four subtype groups were considered.
Finally, we perform a quantile normalization [32] for each group separately, so that the marginal distributions of all the genes match across groups. The purpose of quantile normalization is to avoid the rejection of H_{0} due to marginal difference (differential expression) instead of different dependency patterns (differential coexpression).
Some illustrative examples
We illustrated the new method using four molecular pathways, including the cell cycle and ERBB pathways from KEGG database, as well as the JAKSTAT and TGFbeta signaling pathways from MSigDB database. All the selected pathways play critical roles in the initiation and progression of many human cancers. For instance, KEGG cell cycle pathway contains 128 genes that coregulate cell proliferation, including ATM, RB1, CCNE1 and MYC. Abnormal regulation among these genes may cause the over proliferation of cells and an accumulation of tumor cell numbers. The ERBB pathway in KEGG database consisted of 87 genes including important protooncogenes and tumor suppressors such as PIK3C, KRAS and STAT5. It is known that ERBB pathway is closely related to the development of a wide variety of types of tumor. Especially, the excessive signaling of growth factor receptors ERBB1 and ERBB2 are critical factors in the malignancy of solid tumor [3]. The JAKSTAT signaling pathway and TGFbeta signaling pathway were also known to play critical roles in tumor suppression and cancer metastasis. For instance, TGFbeta can modulate processes such as cell invasion, immune regulation, and microenvironment modification that cancer cells may exploit to their advantage [33].
For each subtype group, we first computed the distance correlation matrix and corresponding pvalue matrix for all gene pairs (see Methods section for details). A BenjaminiHochberg (BH, [34]) procedure with FDR≤0.05 was then applied to screen out uncorrelated genes. A gene pair was deemed as uncorrelated if the adjusted pvalues in four subtypes are all above 0.05. This screening resulted in a total of 487 correlated gene pairs in cell cycle pathway, 359 in ERBB pathway, 592 in JAKSTAT signaling pathway and 440 in TGFbeta signaling pathway. These four reduced sets of gene pairs were used as the search space for the testing step.
Comparison with Pearson’s correlation method
Validation by METABRIC data
Discussion
In this article, we developed a nonparametric method to effectively identify variability in gene coexpression pattern among multiple phenotypes. Our work presents novelty in two aspects. Firstly, we dropped the assumption of joint normality between genes and directly test if a gene pair follow the same joint distribution over different phenotypes. By a graphbased approach, the comparison between multivariate distributions was transformed to an edgecount test which is easy to implement. The statistical test used in this study is fully nonparametric and it rejects null hypothesis under different types of differential coexpressions including linear and nonlinear types. By a real life application, we demonstrated how the proposed test is better able to capture the DCE genes as compared to the Pearson’s correlation method.
Second, to make the test applicable to largescale data, we employed a distance correlation measure to filter out all the noncoexpressed gene pairs prior to the testing step. One shortcoming of the edgecount test is that it requires the calculation of a similarity graph that connects all the samples. For example, in our analysis of the breast cancer data, a 3MST (union of three nonoverlapping MSTs) was used as the similarity graph. Under large number of genes, this step can be computationally expensive. As a well accepted fact in biology, most gene coexpression networks are overall sparse, although they might be locally dense, hence the coexpression screening step should considerately reduce the search space. In the example of KEGG cell cycle pathway, the search space was reduced from more than 8000 gene pairs to less than 500.
Throughout this paper, we have focused on the study of coexpression between two genes. Nevertheless, it is noteworthy that the proposed test can be readily applied to multiplegene cases. In fact, Chen and Friedman’s test, as well as the multigroup extension, is merely built upon a similarity graph connecting all the samples, and the construction of graph depends only on the interpoint distances regardless of the dimension [9]. In practice, one can simply use Euclidean norm as the interpoint distance and construct the similarity graph such as MST or kMST. Additionally, because of the flexibility of our approach, one can also explicitly test for the difference in a higherorder interaction such as threeway gene coexpression, by properly controlling all the marginals and lowerorder interactions.
Conclusions
Differential coexpression analysis is critical for the identification of diseaserelated factors. Motivated by the fact that nonlinear coexpressions generally exist in cellular regulations, we develop a new nonparametric method for DCE analysis, which measures and compares gene coexpressions in linear and nonlinear aspects. Our method does not rely on any assumption regarding the probability distributions of the genes being studied, but it generally tests the equality of two or multiple coexpression patterns through a powerful graphbased test. For practical consideration, we suggest a screening step based on distance correlation to tackle the computational burden for largescale data. The proposed computational procedure can also be applied to other similar bioinformatics problems such as the differential comethylation analysis [36, 37] and differential gene set analysis [38, 39].
Notes
Funding
Support has been provided in part by the Arkansas Biosciences Institute, the major research component of the Arkansas Tobacco Settlement Proceeds Act of 2000.
Availability of data and materials
The TCGA data for breast cancer can be downloaded from Genomic Data Commons (https://gdc.cancer.gov). The METABRIC data (normalized gene expression data) can be downloaded from http://www.cbioportal.org/study?id=brca_metabric#summary. The KEGG pathways and MSigDB hallmark gene sets can be found at Gene Set Enrichment Analysis (http://software.broadinstitute.org/gsea/index.jsp).
Authors’ contributions
The author read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable
Competing interests
The author has declared that no competing interests exist.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2011; 9(559).Google Scholar
 2.Yu H, Liu BH, Ye ZQ, Li C, Li YX, et al.Linkbased quantitative methods to identify differentially coexpressed genes and gene pairs. BMC Bioinformatics. 2011; 12(316).Google Scholar
 3.The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490:61–70.CrossRefPubMedCentralGoogle Scholar
 4.Wang L, Mo Q, Wang J. MIrExpress: A databased for gene expression correlation in immune cells based on mutual information and Pearson correlation. J Immunol. 2015; 2015(140819).Google Scholar
 5.Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNAseq data. BMC Bioinformatics. 2013; 14(91).Google Scholar
 6.Li J, Li YX, Li YY. Differential Regulatory Analysis Based on Coexpression Network in Cancer Research. BioMed Res Int. 2016; 2016(4241293):1–8.Google Scholar
 7.McKenzie AT, Katsyv I, Song W, Wang M. DGCA: A comprehensive R package for Differential Gene Correlation Analysis. BMC Syst Biol. 2016; 10(106), Zhang, B.Google Scholar
 8.Santo S, Galatro T, Watanabe R, ObaShinjo S, Nagahashi M, et al.GoGA: An R package to identify differentially coexpressed gene sets by analyzing the graph spectra. PLoS One. 2015; 10(8).Google Scholar
 9.Chen H, Friedman JH. A new graphbased twosample test for multivariate and object data. J Am Stat Assoc. 2017; 112:397–409.CrossRefGoogle Scholar
 10.Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J Am Stat Assoc. 2012; 107:1129–39.CrossRefPubMedPubMedCentralGoogle Scholar
 11.Szekely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation distances. Ann Stat. 2007; 35:2769–94.CrossRefGoogle Scholar
 12.Szekely GJ, Rizzo ML. Brownian distance covariance. Ann Appl Stat. 2009; 3:1233–303.CrossRefGoogle Scholar
 13.Szekely GJ, Rizzo ML. The distance correlation ttest of independence in high dimension. J Multivar Anal. 2013; 117:193–213.CrossRefGoogle Scholar
 14.Rizzo ML, Szekely GJ. 2017. R package energy: https://cran.rproject.org/web/packages/energy/energy.pdf.
 15.Albukhaiti H, Luo J. Using differential nonlinear gene coexpression network analysis for identification gastric cancer related genes. Biomed Res. 2017; 28(18):8014–7.Google Scholar
 16.Clark M. A comparison of correlation measures. Technical report of University of Notre Dame. 2013.Google Scholar
 17.Kinney J, Atwal G. Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci U S A. 2014; 111(9):3354–9.CrossRefPubMedPubMedCentralGoogle Scholar
 18.Simon N, Tibshirani S. Comment on “Detecting novel associations in large data sets”. 2013. http://statweb.stanford.edu/tibs/reshef/comment.pdf.
 19.Lopes RHC, Hobson PR, Reid ID. Computationally efficient algorithms for the twodimensional KolmogorovSmirnov test. J Phys Conf Ser. 2008; 19(4).Google Scholar
 20.Friedman JH, Rafsky LC. Multivariate generalizations of the WaldWolfowitz and Smirnov twosample tests. Ann Stat. 1979; 7(4):697–717.CrossRefGoogle Scholar
 21.Rosenbaum PR. An exact distributionfree test comparing two multivariate distributions based on adjacency. J R Stat Soc Ser B. 2005; 67(4):515–30.CrossRefGoogle Scholar
 22.Cheriton D, Tarjan R. Finding minimum spanning trees. SIAM J Comput. 2006; 5(4):724–42.CrossRefGoogle Scholar
 23.Zhang Q, Mahdi G, Chen H. A graphbased multisample test for identifying pathways associated with cancer progression. Technical Report. 2017.Google Scholar
 24.Meyer PE. 2014. R package infotheo: https://cran.rproject.org/web/packages/infotheo/infotheo.pdf.
 25.Zhang X, Zhao X, He K, Lu L, Cao Y, et al.Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics. 2012; 28(1):98–104.CrossRefPubMedGoogle Scholar
 26.Zhang Q, Burdette JE, Wang JP. Integrative network analysis of TCGA data for ovarian cancer. BMC Syst Biol. 2014; 8(1338):1–18.Google Scholar
 27.Anders S, Pyl PT, Huber W. HTSeq  a Python framework to work with highthroughput sequencing data. Bioinformatics. 2015; 31:166–9.CrossRefPubMedGoogle Scholar
 28.Hsu F, Serpedin E, Hsiao T, Bishop A, Dougherty E, et al.Reducing confounding and suppression effects in tcga data: an integrated analysis of chemotherapy response in ovarian cancer. BMC Genomics. 2012; 13(S13).Google Scholar
 29.Liu MC, Pitcher BN, Mardis ER, Davies SR, Friedman PN, et al.PAM50 gene signatures and breast cancer prognosis with adjuvant anthracycline and taxanebased chemotherapy: correlative analysis of C9741. Breast Cancer. 2016; 2(15023).Google Scholar
 30.HaibeKains B, Desmedt C, Loi S, Culhane AC, Bontempi G, et al.A threegene model to robustly identify breast cancer molecular subtypes. J Natl Cancer Inst. 2012; 104(4):311–25.CrossRefPubMedPubMedCentralGoogle Scholar
 31.Gendoo DM, Ratanasirigulchai N, Schroder MS, Pare L, Parker JS, et al.Genefu: an R/Bioconductor package for computation of gene expressionbased signatures in breast cancer. Bioinformatics. 2016; 32(7):1097–9.CrossRefPubMedGoogle Scholar
 32.Qiu X, Wu H, Hu R. The impact of quantile and rank normalization procedures on the testing power of gene differential expression analysis. BMC Bioinformatics. 2013; 14(124).Google Scholar
 33.Massague J. TGFbeta in cancer. Cell. 2008; 134(2):215–30.CrossRefPubMedPubMedCentralGoogle Scholar
 34.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995; 57(1):289–300.Google Scholar
 35.Curtis C, Shah S, Chin S, Turashvili G, Rueda O, et al.The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature. 2012; 486:346–52.CrossRefPubMedPubMedCentralGoogle Scholar
 36.Zhang J, Huang K. Pancancer analysis of frequent DNA comethylation patterns reveals consistent epigenetic landscape changes in multiple cancers. BMC Genomics. 2017; 18(1045).Google Scholar
 37.Wang F, Xu H, Zhao H, Gelernter J, Zhang H. DNA comethylation modules in postmortem prefrontal cortex tissues of European Australians with alcohol use disorders. Sci Rep. 2016; 6(19430).Google Scholar
 38.de Leeuw CA, Neale BM, Heskes T, Posthuma D. The statistical properties of geneset analysis. Nat Rev Genet. 2016; 17(6):353–64.CrossRefPubMedGoogle Scholar
 39.Mishra PP, Medlar A, Holm L, Toronen P. Robust multigroup gene set analysis with few replicates. BMC Bioinformatics. 2016; 17(526).Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.