Background

The RNA-sequencing (RNA-Seq) leverages the rapid breakthroughs of the next-generation sequencing platform for profiling high-quality gene expression. Over the span of years, the RNA-Seq has emerged as an alternative to other gold standard techniques in transcriptomes [1, 2]. In contrast to microarrays, RNA-Seq achieves a higher resolution and lower technical variability [2,3,4] which leads to a higher reproducibility [5]. Another advantage of RNA-Seq relative to previously developed transcriptomic sequencing methods is that it has the ability to track transcriptomic dynamics (or gene expression changes) of tissues during physiological changes [5, 6], which thus allows a comparison of biological samples from patients with or without a specific disease or condition.

In response to these advantages, a vast number of statistical methods have become available to elucidate the genes or biological pathways associated with biological conditions or health outcomes, such as differential expression (DE) analysis [7, 8] and pathway enrichment (PE) analysis [9,10,11] of read counts (or gene expression) of an RNA-Seq data. However, it can be argued that results of DE analysis may provide limited information with the increased evidence that genes work in conjunction each other [12, 13]. The PE analysis appears to be a useful complement to the analysis of DE. The fundamental hypothesis in a PE analysis is that genes are regulated under common biological processes and clustered as a ‘pathway’ [13, 14], which borrows a priori pathway knowledge from the public repositories, namely, Gene Ontology [15], Kyoto Encyclopedia of Genes and Genomes [16], or Reactome [17]. To put it another way, PE analysis is primarily restricted to its use in a reference collection of well-studied biological processes only. Thereby, the idea of ‘network’ is introduced to pursue the veiled information that are obscured in those well-defined pathways [18].

The differential network (DN) analysis provides novel insights for identifying changes in gene-gene interactions under different biological conditions [19]. In theory, such changes are assessed through a comparison in characteristics of a network structure (i.e., network topology) between two or more networks that are perturbed by a specific biological condition such as the development of cancer.

Despite the growing popularity, none of existing methods [20,21,22] fully addresses how to adjust for additional covariates (e.g. patient-age, patient-reported family histories, and other comorbidities) that may be associated with network topology.

Recently, we have adopted a pseudo-value regression [23] that allows covariate adjustment for the DN analysis while maintaining a high precision and recall values via a Monte Carlo simulation comparing with other methods available in R packages such as DINGO [20] and dnapath [22]. In addition, the computation time of this approach was shown competitive. To date, this is the first attempt of statistical method for the DN analysis with the inclusion of additional covariates.

In this article, we describe the software built as an R package, namely PRANA (Pseudo-value Regression Approach in Network Analysis). PRANA is tailored to incorporate additional covariates information that may be associated with measures of connectivity of a gene (i.e. centrality) and a binary group indicator. This differs from previous statistical framework (or softwares) in DN analysis such as dnapath and DINGO.

Implementation

Algorithm

The algorithm below summarizes how the pseudo-value regression approach is embedded in a function named with PRANA. Briefly, the association measures are marginal quantities, such as degree centralities of each gene. Through the use of jackknife pseudo-values [24], we find the contribution of each individual data point to these quantities. Therefore, we could regress them on additional covariates as shown in studies with multi-state survival data [25, 26]. More details on methodological aspects are fully described elsewhere [23].

Algorithm 1 PRANA: Pseudo-value Regression Approach in Network Analysis

Details of functions in PRANA

Main function

The main R function to perform the pseudo-value regression for the DN analysis with additional covariates is PRANA. The PRANA function imports two R scripts for the calculation of (1) total connectivity of an association matrix estimated from an observed expression data (as in thetahats function) and (2) adjusted p-values with the empirical Bayes screening procedure (as in EBS function) [28]. A list of three data.frame objects (coefficient estimates, p-values, and adjusted p-values of each predictor variable included in the regression for each gene) are returned upon the execution of PRANA function.

Supporting functions

For user convenience, we provide six additional R functions for extracting adjusted p-values (adjpval, adjpval_specific_var), coefficient estimates (coeff, coeff_specific_var), and genes that are significantly DC (sigDCGnames, sigDCGtab) from the output from PRANA function.

Dependencies

The PRANA package is fully implemented in R statistical programming language. The package depends on the base R packages (parallel, stats) and other R packages from the Comprehensive R Archive Network library (CRAN; dnapath, dplyr, robustbase) and Bioconductor (minet). Of important note, minet package should be directly installed from Bioconductor for a full utilization of PRANA package. This can be done by executing the code below in the R console.

figure b

Results

In this section, we illustrate how PRANA can be applied in practice using the sample dataset available from the package. This case study is the same as the one analyzed in our methodology paper [23].

Sample dataset

The PRANA package includes a sample dataset named combinedCOPDdat_RGO with 406 samples. combinedCOPDdat_RGO consists of an RNA-Seq expression data for 28 genes that were spotlighted as associated with the chronic obstructive pulmonary disease (COPD) from a genome-wide association study [29]. It is a subset of the original study stored in the Gene Expression Omnibus (GEO) database with the accession number GSE158699 [30]. In this sample dataset, a phenotype data on six clinical and demographic variables is also available: current smoking status (main grouping variable), smoking pack years, age, gender, race, and FEV1 percent. The user can call the sample data into R by executing the following code:

figure c

Alternatively, the user can also assign the data to an object by running the code below:

figure d

Data processing

The PRANA function requires a user to provide each expression and phenotype data separately.

figure e

The main predictor variable in this example analysis is the current smoking status. As discussed in the Algorithm subsection, the estimation of association matrices (or networks) and the calculation of jackknife pseudo-values are carried out for each group separately. Hence, we add another step that locates the indices of subjects from each ‘current’ vs. ‘non-current smokers’ group. These indices are used to dichotomize expression dataset into ‘non-current smokers (Group A)’ and ‘current (Group B).’

figure f

Apply PRANA function for DN analysis with additional covariates

Once the data processing is complete, a user can perform a DN analysis with additional covariates. PRANA function takes an expression and phenotype data, separately, in which a user specifies each for RNASeqdat and clindat, respectively. To be more specific, the variables included in phenotype data are included in the regression. In addition, the group-specific indices for the main binary indicator variable are provided as groupA and groupB within the function.

figure g

The output of the PRANA function is a list containing three data.frame objects for coefficient estimates, p-values, and adjusted p-values of all covariates included in the fitted model for each gene. Results are shown as following:

figure h

Some supporting functions

The package offers some auxiliary features. A user can get a table of adjusted p-values and coefficient estimates for all variables with adjptab and coeff functions as following:

figure i

Suppose, for instance, we are interested in looking at the adjusted p-values for the current smoking status variable instead of a table with all variables. adjpval_specific_var function is available for that purpose:

figure j

Similarly, coeff_specific_var function can be executed to return a coefficient estimate for a specific variable (current smoking status in the example below). A cautionary note is that the user must provide the name of a variable as in varname within each adjpval_specific_var or coeff_specific_var functions.

figure k

Additionally, sigDCGtab and sigDCGnames functions take a data.frame object as an input, defined by adjpval function earlier, to output the names of DC genes (i.e. NCBI Entrez gene IDs in the first column) for the main binary grouping variable utilized for the DN analysis, as well as corresponding adjusted p-values. sigDCGnames returns the names of DC genes only. A user may adjust the level of significance (alpha), which is set to 0.05 by default. Please see the following commands below:

figure l

As a result, PRANA identified 23 genes that are significantly DC between current and non-current smokers while accounting for additional covariates such as smoking pack years, age, gender, race, and FEV1 percent.

As an additional step, a user can utilize rename_genes function from the dependency package (dnapath) to rename results with Entrez gene IDs into gene symbols. See below for the demonstration in R console. Results are summarized in Table 1.

figure m
Table 1 Results of DC genes obtained from PRANA. The sample dataset contains the NCBI Entrez gene IDs, so does the resulted DC genes (first column). dnapath::rename_genes is utilized to rename Entrez gene IDs to gene symbol (second column)

Discussion

The R package PRANA has been published in the CRAN (https://cran.r-project.org/web/packages/PRANA/index.html). This package has no operating system dependencies. A vignette is available on this package at https://cran.r-project.org/web/packages/PRANA/vignettes/UserManualPRANA.html or can be accessed by typing in an R console (browseVignettes(package="PRANA")). In this package, the sample dataset is provided with COPD-related genes, as well as clinical and demographic variables. The source code of the package can be found in GitHub: https://github.com/sjahnn/PRANA.

PRANA has some plans for future development. Firstly, although a user may attempt a classical regression-based variable selection such as stepwise selection, we have not yet validated this through a statistical simulation experiment. Secondly, the names of genes provided in the sample dataset are Entrez gene IDs. Further extension will include a function that convert from these gene IDS to gene symbols (i.e. 10370 to CITED2) and vice versa for user convenience.

In conclusion, PRANA is a user-friendly and novel regression-based method that accounts for additional covariates along with the main binary grouping variable for the DN analysis.

Conclusions

The differential network analysis identifies changes in measures of associations between genes under different biological conditions. Although there has been increasing volume of work in this subject, overall covariate adjustment remains underexplored. In this paper, we present PRANA, the first R package that adjusts for additional covariates for the differential network analysis. As a brief note on the usage, PRANA takes RNA-sequencing and phenotype data (metadata) as inputs and in return tables containing DC gene names and their corresponding adjusted p-values are produced for a main binary grouping variable to be adjusted with the presence of additional covariates. This software is easy to install and user-friendly.

Availability and requirements

Project name: PRANA

Project home page: https://cran.r-project.org/web/packages/PRANA/index.html

Operating system(s): Platform independent

Programming language: R

Other requirements: Install dnapath, dplyr, robustbase, and minetR packages

License: GNU GPL-3

Any restrictions to use by non-academics: No restrictions