Efficient identification of multiple pathways: RNA-Seq analysis of livers from 56Fe ion irradiated mice

Nia, Anna M.; Chen, Tianlong; Barnette, Brooke L.; Khanipov, Kamil; Ullrich, Robert L.; Bhavnani, Suresh K.; Emmett, Mark R.

doi:10.1186/s12859-020-3446-5

Efficient identification of multiple pathways: RNA-Seq analysis of livers from ⁵⁶Fe ion irradiated mice

Methodology article
Open access
Published: 20 March 2020

Volume 21, article number 118, (2020)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Efficient identification of multiple pathways: RNA-Seq analysis of livers from ⁵⁶Fe ion irradiated mice

Download PDF

Anna M. Nia¹,
Tianlong Chen²,
Brooke L. Barnette¹,
Kamil Khanipov³,
Robert L. Ullrich⁴,
Suresh K. Bhavnani² &
…
Mark R. Emmett^1,3,5

4933 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

Background

mRNA interaction with other mRNAs and other signaling molecules determine different biological pathways and functions. Gene co-expression network analysis methods have been widely used to identify correlation patterns between genes in various biological contexts (e.g., cancer, mouse genetics, yeast genetics). A challenge remains to identify an optimal partition of the networks where the individual modules (clusters) are neither too small to make any general inferences, nor too large to be biologically interpretable. Clustering thresholds for identification of modules are not systematically determined and depend on user-settable parameters requiring optimization. The absence of systematic threshold determination may result in suboptimal module identification and a large number of unassigned features.

Results

In this study, we propose a new pipeline to perform gene co-expression network analysis. The proposed pipeline employs WGCNA, a software widely used to perform different aspects of gene co-expression network analysis, and Modularity Maximization algorithm, to analyze novel RNA-Seq data to understand the effects of low-dose ⁵⁶Fe ion irradiation on the formation of hepatocellular carcinoma in mice. The network results, along with experimental validation, show that using WGCNA combined with Modularity Maximization, provides a more biologically interpretable network in our dataset, than that obtainable using WGCNA alone. The proposed pipeline showed better performance than the existing clustering algorithm in WGCNA, and identified a module that was biologically validated by a mitochondrial complex I assay.

Conclusions

We present a pipeline that can reduce the problem of parameter selection that occurs with the existing algorithm in WGCNA, for applicable RNA-Seq datasets. This may assist in the future discovery of novel mRNA interactions, and elucidation of their potential downstream molecular effects.

Identification of Gene Communities in Liver Hepatocellular Carcinoma: An OffsetNMF-Based Integrative Technique

CEMiTool: a Bioconductor package for performing comprehensive modular co-expression analyses

Article Open access 20 February 2018

Robust normalization and transformation techniques for constructing gene coexpression networks from RNA-seq data

Article Open access 03 January 2022

Background

RNA-Seq, an approach to genome profiling that uses deep-sequencing technologies, has become an increasingly common technique to understand biological phenomena at the molecular level. This method generates quantitative count data on thousands of different mRNAs within each experiment. Comparing the expression of genes between different experimental conditions identifies hundreds of differentially expressed genes, but translating these lists into key functional distinctions between conditions has proved challenging. Since there are thousands of genes in each sample, many researchers filter their gene lists based on different criteria, in order to extract meaningful biological information. One such filtering criteria is based on differential gene expression analysis. Differential gene expression analysis has traditionally been used to determine genes that are statistically significantly differentially expressed between different experimental conditions based on different metrics, such as non-parametric generalized linear models, independent sample t-tests, and log₂ fold changes [1]. Even though differential gene expression analysis is one of the most common methods for identifying disease pathways in various experimental conditions, it does not take into consideration the interactions of genes that work as a system to coordinate cellular functions. As a result, using only differential gene expression analysis would limit mechanistic interpretations of the data. mRNAs never act in isolation, but rather in concert with each other and other signaling molecules to define a particular biological pathway and function. Interactions of these signaling molecules can be viewed as networks of interconnected genes and their partners, that are up/down regulated under certain chemical or environmental conditions.

Many algorithms that utilize network theory have found applications in identifying and analyzing these molecular interactions [2,3,4]. Correlation networks are an example of such algorithms, and describe the co-expression of many genes in response to changing conditions, which can ultimately provide information about the underlying molecular mechanisms or biochemical pathways [5, 6]. In particular, the Weighted Gene Co-expression Network Analysis (WGCNA) method, which is provided as an R software package, has been widely used for performing different aspects of weighted correlation network analysis [7]. The co-expression networks used in WGCNA are constructed based on correlations between the quantitative measurements of each gene, and can be described by an n x m matrix X = [x_ij]. Here the row indices (i = 1,…,n) correspond to different genes, and the column indices (j = 1,…,m) correspond to different sample measurements. While co-expression networks integrate systems-level information to provide a mechanistic interpretation of the dataset, detecting modules (clusters) of closely related mRNAs within the co-expression networks has been a challenging problem. Significant pathways that are identified by different clustering methods often yield tens or hundreds of genes, making biological interpretation and validation challenging. Further, many clustering techniques such as Dynamic Tree Cut utilized in WGCNA rely on user-settable parameters, including minimum module size, and are sensitive to cluster splitting [8, 9]. While many of these module detection methods perform optimally on some datasets, they may fail to effectively detect patterns in other datasets. A practical challenge in terms of discovering modules and determining the total number of modules is the identification of the optimal number of modules in the network, such that the individual modules are neither too large, preventing meaningful interpretation, nor too small, allowing little to no general inference. In general, characterizing and detecting community structures within networks has been a challenging problem in the study of networks [10,11,12]. One of the most commonly used metrics to investigate community structure is a quality index for clustering known as Modularity [13,14,15]. In spite of its popularity, Modularity does have drawbacks. The resolution limit (RL) problem is one of the most significant drawbacks, referring to the problem of maximizing Modularity while hindering one’s ability to detect communities that contain fewer links [16]. To address this problem, several approaches have been introduced [17,18,19,20]. Of these approaches, Modularity Maximization, which utilizes modularity density measures, has been shown to eliminate rather than merely reduce the RL problem in a wide range of networks [20].

In this study, we propose a pipeline using Modularity Maximization [20] to effectively detect and evaluate modules from co-expression networks obtained from the adjacency matrix, utilizing WGCNA [4, 7]. We employ the above technique to characterize the effects of ⁵⁶Fe irradiation on mice livers, in order to study the potential consequences of deep space travel. In particular, astronauts will be exposed to high-charge, high-energy ions (HZE) during deep space travel. Even at low doses, exposure to HZE can lead to cancer [21, 22]. However, the effects of ions found in the deep space environment on cancer formation is not well understood since very few people have been exposed to space irradiation. As human exploration into deep space increases in the future, characterization of and intervention in irradiation-induced diseases will become more important. Previous studies have shown that irradiation of mice with low-dose HZE, specifically ⁵⁶Fe ions, significantly increases the incidences of hepatocellular carcinoma (HCC) [23, 24]. HCC is the most common type of liver cancer, and its formation has mainly been studied in the context of terrestrial risk factors such as chronic hepatitis B/C virus infection, exposure to aflatoxin, obesity, smoking, and heavy alcohol consumption [25,26,27]. However, there is limited knowledge of the effects of low-dose ⁵⁶Fe ion irradiation on the formation of HCC. To better understand the molecular mechanisms of low-dose ⁵⁶Fe induced HCC, we used RNA-Seq to determine gene expression changes in the hepatic micro-environment of ⁵⁶Fe ion irradiated compared to non-irradiated control mice at 5 different time points post-irradiation. We hypothesized that mitochondrial pathways could be significantly affected, since mitochondria represent a substantial cellular target volume (4–25% depending on the cell) [28]. In this manuscript, we will describe how WGCNA can be integrated with Modularity Maximization to construct co-expression correlation networks of differentially expressed genes and detect modules using data obtained from RNA-Seq.

Results

Differential gene expression analysis

Results of differential gene expression analysis are shown in Table 1, which includes the total number of differentially expressed genes at each time point, as well as whether genes are up/down-regulated.

Table 1 Results of differential gene expression analysis of RNA-Seq data from ⁵⁶Fe Irradiated and non-Irradiated control mice livers at various time points analyzed using edgeR package.

Full size table

Feature selection

A total of 2273 unique differentially expressed genes were identified in comparison between ⁵⁶Fe irradiated and non-irradiated controls. Genes that were statistically significant with FDR ≤ 10^− 5 were used for downstream network analysis; 487 unique genes met the filtration criteria. The significance cut off can be adjusted to a higher value if a researcher decides to investigate more genes, depending on the study goals, experimental conditions, and data variability.

WGCNA

We initially used the WGCNA Dynamic Tree Cut algorithm [7] to identify modules within the selected differentially expressed genes. Module identification with this algorithm requires two parameters to be determined prior to network construction: deepSplit, and minClusterSize. deepSplit can be either logical or an integer in the range 0 to 4. It controls the sensitivity to cluster splitting. Higher values result in smaller clusters. minClusterSize represents the minimum number of genes needed in a module to be considered a separate module. Table 2 shows the results of WGCNA module identification using different minClusterSize values, with a default deepSplit value of 2. As minClusterSize increases, the total number of modules decreases. These values produce different types of networks with differing numbers of unassigned genes. If a gene does not belong to a specific module, it is assigned to the Grey/Unassigned Module. The number of unassigned genes varied between 36 and 73 (shown in the last column of Table 2). At the same time, the total number of modules needs to be within a reasonable range, in order to be able to meaningfully investigate the relationship between genes; 70 different modules each containing a few genes may not provide meaningful information about these co-expression patterns. In our dataset, networks with a total of 11–18 modules provided interpretable co-expression patterns for further investigation, using pathway analysis tools as well as experimental validation. However, these clustering parameters resulted in 61–69 unassigned genes, representing ~ 12–14% of the 487 selected highly significant features.

Table 2 WGCNA Results with Dynamic Tree Cut Algorithm: deepSplit provides a rough control over the sensitivity to cluster splitting. The higher the value (or if TRUE), the more and smaller clusters will be produced. The Dynamic Tree Cut may identify modules whose expression profiles are very similar. The parameter minClusterSize allows one to control the minimum number of genes in a module, helping to avoid having similar clusters of few genes. As shown in the table, the lower values of minClusterSize increase the ‘Total Number of Modules’. Moreover, as this number increases, the ‘Number of Genes in Unassigned Module (Grey)’ increases as well

Full size table

WGCNA with Modularity Maximization

To optimize the number and size of identified modules as well as reduce the number of unassigned genes (~ 12–14%), we exploited the concept of Modularity Maximization, to assist in finding community structures, as an alternative to utilizing the Dynamic Tree Cut algorithm employed in the standard WGCNA pipeline. Dynamic Tree Cut relies on hierarchical clustering, which is based on the relative distance between genes and samples. Modules are detected by “cutting” these trees, which can lead to many different small modules or a few large modules, depending on the selection of the minClusterSize and deepSplit parameters. Using Modularity Maximization, we were able to identify modules without the need to set these parameters empirically. In particular, the adjacency matrix with a soft threshold beta of 16 (corresponding to R² = 0.9) was first computed using WGCNA, then a clustering algorithm based on Modularity Maximization was used to automatically find community structures in our dataset. We chose the Modularity Maximization method, since the metric of Modularity has been widely used to detect and assess community structures in social and biological networks since its inception [20, 29,30,31,32].

Utilizing the modularity-based clustering algorithm to identify modules, 14 modules were discovered, and only 14 individual genes were unassigned. The final modularity score was Q = 0.696, which is indicative of a strong modular structure in the network. Figure 1 depicts the 14 modules in the network, and Table 3 shows the number of genes included in each module along with the enriched molecular pathways, as discussed below.

Table 3 Ingenuity Pathway Analysis on individual modules

Full size table

Module validation and properties

To explore the biological relevance of the modules, each module was investigated by Ingenuity Pathway Analysis (IPA). Specifically, module 1 was shown to be significant (−log₁₀(p-value) ≥ 1.3) in mitochondrial pathways, such as the Sirtuin Signaling Pathway, Mitochondrial Dysfunction, and Oxidative Phosphorylation [31]. All the genes involved in mitochondrial dysfunction in our dataset were contained in module 1. In particular, Fig. 2 shows that 5 of these genes express different subunits of mitochondrial complex I and III. To validate these results, we performed an additional experimental technique to determine whether complex I activity is reduced in response to ⁵⁶Fe irradiation. Complex I activity was observed to be decreased in response to exposure to ⁵⁶Fe HZE ion across all time points as measured by mitochondrial complex I enzyme activity (Fig. 3). The downstream effects of irradiation on mitochondrial functions have been emphasized [33], as mitochondria have been shown to occupy a substantial fraction of the cell volume [28]. Therefore, they may be fairly easily targeted by irradiation as the ⁵⁶Fe nuclei traverse the cell. The electron transport chain in the mitochondrion is composed of five protein complexes (I-V) that perform a series of oxidation-reduction reactions, in which O₂ is the final electron acceptor and is reduced to a water molecule. One of the consequences of this process is the formation of reactive oxygen species (ROS), which is thought to arise from the leakage of electrons, specifically from complex I and III, and to a minor extent complex II [34,35,36]. Using oxygen as the final electron acceptor causes mitochondria to consume about 90% of the body’s oxygen but also become the richest source of ROS [36,37,38,39]. The upregulation of mitochondrial genes shown in Fig. 2, specifically in complex I and complex III, suggests that leakage of electrons from these two complexes results in increased Complex I and III enzyme activity. This leads to further the overexpression of these genes in response to ⁵⁶Fe irradiation. Other modules could potentially be validated in future experimental designs, involving live animals and more fresh tissues. For example, module 2 can be tested for JAK/STAT signaling. STATs are ubiquitously expressed and mainly activated after stimulation of cytokine receptors. STATs function in the nucleus, but they are first activated in the cytoplasm and have then to be transported into the nuclear compartment [40]. This translocation can be assessed by indirect immunofluorescence. Additionally, STAT signaling can be experimentally validated by pharmacologically inhibiting STAT pathways with specific STAT inhibitors. Similarly, module 9 could be tested for Endoplasmic Reticulum (ER) stress pathways. Several molecular indicators of ER stress could be examined by Western Blots and/or proteomic analysis, which could demonstrate increased or decreased phosphorylation of ER stress proteins.

Discussion

One of the current statistical challenges in identifying co-expression patterns in RNA-Seq data is a robust determination of the number and size of modules appropriate, across a variety of datasets. The choice of an appropriate clustering algorithm that yields the most biologically interpretable networks has been studied using different datasets and methods. For example, a study has investigated Recursive Indirect-Paths Modularity (RIP-M) for module detection in an RNA-Seq co-expression network. Using an influenza vaccine response study, the authors showed that RIP-M had higher cluster assignment accuracy as compared to Newman Modularity, and similar results to WGCNA [41]. We compared WGCNA, RIP-M, and the combined WGCNA-M method based on the Rand Index (RI) [29]. In calculating the RIs, we considered every unassigned gene, as a cluster by itself, since such genes are viewed as not being similar to each other. The RIs for WGCNA-M versus WGCNA, WGCNA-M versus RIP-M, and WGCNA versus RIP-M were 0.909, 0.892, 0.936, respectively. The numbers of genes unassigned to a cluster for WGCNA-M, WGCNA, and RIP-M were 14, 108, and 0, respectively. All of the 14 genes which were unassigned in WGCNA-M were also unassigned in WGCNA. Based on our observed RIs, RIP-M and WGCNA were the most similar when applied to our dataset. Like WGCNA, RIP-M also requires the parameter minModuleSize (minClusterSize) as well as an additional parameter, maxModuleSize, which specifies a target range of module sizes. All genes assigned to a module below minModuleSize are then grouped together and merged into a further module. Modules above maxModuleSize are split in subsequent iterations to arrive at the target range. RIP-M forces all genes to be assigned to a cluster; the 14 genes unassigned by WGCNA and the WGCNA-M approach for our dataset were placed into cluster 1 by RIP-M. Community detection method selection is an important issue in cluster analysis and may greatly influence the results of a study and their biological interpretability. Therefore, it is imperative to select the most suitable method for each specific experimental design, and for the nature of the data being investigated. A complete list of gene cluster-assignments for each method is provided in Supplemental Table 1.

Utilizing Modularity Maximization to detect community structures provides an additional way to construct a network and explore various RNA-Seq datasets; however, WGCNA-M is not limited to this application domain, and can be applied to detect co-expression patterns amongst other omics studies. Protein or lipids can be linked together in networks via a defined functional relationship in a similar fashion. Methodologically, MS-based proteomics and lipidomics tend to have consistency, and coverage issues [42,43,44,45,46,47] as compared to RNA-based high throughput methods. As a result, some network analysis methods as applied to proteomic data may not capture the complexity and nuances underlying biological processes, and alternative approaches may be needed to complement the existing analytical tools. Similar to any other analytical method, the network-based WGCNA-M analysis method must be applied appropriately based on the inherent quality and nature of each dataset. This will allow us to gain robust biological insight and decipher the unique patterns in our data from which we can further understand the complexity and coordinated function of the system being investigated. In the current study, utilization of WGCNA with Modularity Maximization resulted in the identification of biologically interpretable and relevant modules, without the need for parameter optimization.

Conclusions

In this study, we proposed a new pipeline that combines the adjacency matrix notion of WGCNA with Modularity Maximization to find modules that are involved in specific biological pathways. To show the validity of the identified modules, we conducted gene enrichment analysis and experimental validation. Our results showed that mitochondrial pathways that were changed in response to irradiation were contained in the same module. Further, our data indicates that even after performing stringent feature selection focusing on significant genes (FDR ≤ 10^− 5), WGCNA-M was still able to identify biologically relevant modules. The use of the WGCNA Dynamic Tree Cut clustering algorithm in our dataset resulted in a high number of unassigned genes (61–69). On the other hand, WGCNA-M reduced the number of unassigned genes to 14 while maintaining an optimal number of modules/specific pathways. The proposed pipeline enables the identification of network and community structures without requiring optimization of the minClusterSize and deepSplit parameters. The increasing number of high throughput genomic datasets, together with the use of appropriate network pipelines, will enable researchers to efficiently investigate molecular mechanisms and pathways involved in different disease processes.

Methods

In this section, we describe the WGCNA combined with Modularity Maximization for community detection pipeline used in the RNA-Seq dataset. Our evaluation strategy was targeted to analyze data from an RNA-Seq experiment of ⁵⁶Fe irradiated and non-irradiated control mice liver lobes, designed to characterize the microenvironmental changes induced by HZE irradiation (similar to HZE ions encountered in deep space travel,) and that lead to induction of HCC. Our aim is to detect modules (clusters of genes) that are related by correlation across samples, and differ between experimental conditions. The resulting co-expression networks were analyzed using functional enrichment analysis and experimentally validated.

Animal experiments and sample preparation

C57BL/6NCrl mice purchased from Charles River (Wilmington, MA) were used in this experiment. Tumor induction studies and studies of molecular changes in the irradiated tissues can only be conducted in whole animals. Further, based on an extensive literature search and examination of studies previously approved by the institutional animal care and use committees (IACUCs), computer models or cell culture studies are not possible. The numbers of animals used were based on the expected numbers of irradiation-related tumors that would develop if animals were allowed to live out their lifespans. Power calculations for numbers in this study are based on the chi-square test for comparing two proportions, with a two-sided significance level set at 0.05 at 80% power.

The serial sacrifice study included 15 male mice with 3 mice per time point at five time-points (30, 60, 120, 270, and 360 days) post-exposure to HZE, specifically ⁵⁶Fe irradiation. Additionally, 15 mice were used as controls, at the same time points, resulting in a total of 30 mice for this study. The two groups were: 600 Me V/n ⁵⁶Fe (0.2 Gy) and non-irradiated/sham-irradiated control. The mice were shipped from the vendor to Brookhaven National Laboratories (BNL) and housed at the BNL animal facility until the time of irradiation at the NASA Space Radiation Laboratory (NSRL). Following irradiation, the animals were shipped to the UTMB Animal Resources Center (ARC), quarantined for 1 month, and maintained in the ARC for the duration of the experiment. Animals were housed in sterile cages with free access to food and water. Facilities at both BNL and UTMB are fully AAALAC accredited, ensuring adequacy of animal care at both animal facilities.

At each of the five time-points, 3 animals from each group were randomly selected and euthanized using CO₂ asphyxiation, as per current AVMA guidelines for euthanasia. Prior to euthanasia, animals were weighed and weights recorded. Post euthanasia, tissues of the left lobe of livers were collected, snap-frozen on either dry ice or liquid nitrogen, and stored at − 80 °C until tissues could be extracted for RNA analysis. Livers were sampled by taking two 40-μm thick slices using a cryotome at − 20 °C.

Acquisition of RNA-Seq data

Total RNA was isolated from the liver slices using RNAqueous™ Total RNA Isolation Kit (ThermoFisher Scientific, Waltham, MA), and rRNA was removed using the Ribo-Zero™ rRNA Removal Kit (Illumina, San Diego, CA), prior to library preparation with the Illumina TruSeq RNA Library kit. Samples were sequenced in a paired-end 50 base format on an Illumina HiSeq 1500. FastQC was utilized for the quality evaluation of FASTQ files [48]. All FastQC reports were examined prior to the analysis of RNA-Seq samples. The total number of reads used in analysis varied between 23 and 35 million. A complete list of samples, and related reads information is available in Table 4.

Table 4 Sample List and Total Reads

Full size table

Reads were aligned to the mouse GRCm38 reference genome using the STAR alignment program, version 2.5.3a, with the recommended ENCODE options [49]. The-quantMode GeneCounts option was used to obtain read counts per gene based on the Gencode release M14 annotation file [50].