Background

Increasingly, high-throughput transcriptional profiling techniques (microarrays or, increasingly, RNAseq) inform modern life-science research. Such techniques provide a molecular “camera” taking genome-wide “snap-shots” of genetic activity. However, the effective analysis of microarray data presents a number of challenges, in particular handling the large number of genes that are studied simultaneously.

Analysing gene expression in the context of curated knowledge, or “knowledge base-driven pathway analysis”, is critical as this guides the reduction in search space from many thousands of genes to an subset of biological processes, which are much more tractable to human interpretation [1]. According to Khatri et al [2], pathway enrichment approaches can be divided into three generations:

  1. i.

    Over-representation Analysis (ORA): This scores a pathway by considering the proportion of differentially-expressed genes (DEGs) observed in each pathway relative to the proportion of all microarray DEGs. This is used by several pathway analysis tools, including GenMAPP [3], GoMiner [4], Onto-Express [5] and FatiGo [6].

  2. ii.

    Functional Class Scoring (FCS): FCS gives a score to each gene in a pathway based on its expression, from which a pathway-score is calculated based on the scores of all the genes in the pathway. A number of FCS methods have been implemented through standalone tools such as GSEA [7], SigPathway [8], and SAFE [9], or web tools such as T-profiler [10], Gazer [11] and GeneTrail [12].

  3. iii.

    Pathway Topology (PT)-based approaches: These approaches exploit the topology of pathways by giving weights to pre-defined connections between genes, which inform pathway scoring. Several topology-based approaches have been described in the literature over the past few years. According to Mitrea et al [13], PT-based approaches differ in the way they translate pathway topology information into a pathway score. Some methods use only the topology data of differentially-expressed genes (DEGs) in the enrichment score (for example MetaCore [14] and EnrichNet [15]), whereas others (including SPIA [16] and GANPA [17]) use expression data of DEGs along with the topology data. Alternatively, some methods use expression data derived from all microarray genes, whether they change between conditions or not, for example PathOlogist [18], DEGraph [19], and ACST [20]. Importantly, some PT-based tools use only signalling pathway descriptions, such as Pathway-Express [21], NetGSA [22], ScorePAGE [23], TAPPA [24] MetPA [25], and Clipper [26].

Previously, we proposed a new pathway enrichment method, in which both pathway topology and the magnitude of gene expression changes informed the creation of a Pathway Regulation Score (PRS) [27]. Specifically, by combining fold-change data for those transcripts exceeding a significance threshold, and by taking into account the potential of altered gene expression to impact upon downstream transcription, we identified those pathways most relevant to the pathophysiological process under investigation. Our approach addressed a number of issues that potentially compromise enrichment methods. We took steps to mitigate the influence of errors in ID mapping, and to reduce the bias introduced by highly-redundant pathways (i.e. multiple instances of the same gene). Topology methods also have to handle loops effectively, so we used a search algorithm derived from graph theory to resolve this problem. We also felt that arbitrarily dividing processes into either up- or down-regulated was artificial as changes in gene expression are likely to be distributed throughout pathways, thus ours was an overall impact assessment.

Herein, we described the implementation of our PRS approach as a standalone tool that provides end users with the option of importing data from different microarray platforms and species. The tool yields both PRS and z-scores, provides statistical analysis, and allows browsing of pathways with impacted genes highlighted in different colours. An enhancement from our original report is that users are able to enrich both signalling and metabolic pathways.

Implementation

The PRS approach was implemented in MATLAB. Users without access to the MATLAB environment can down-load the MATLAB Runtime Compiler (MRC) in order to deploy the software described herein, via a user-friendly GUI. The PRS interface (Figure 1) provides users with several functions:

Figure 1
figure 1

The PRS user interface showing analysis of a sample dataset.

Preprocessing microarray data

We did not re-engineer a filter to normalise data from a variety of platforms, rather users must first preprocess transcriptomic data using one of the myriad existing tools. Data must be in the form of a simple Excel spreadsheet, in which the first column should be probe ID, and the following columns normalised replicated expression values from the control and test conditions. Additional information regarding species, sample numbers, fold-change and t-test thresholds, normalisation method and platform is required.

Pathway representation

Our fundamental algorithm was described previously [27]. Briefly, Kyoto Encyclopaedia of Genes and Genomes pathway definitions [28] were used, in which pathways are maintained in KEGG Mark-up Language (KGML) format. We imported a total of 189 signalling and metabolic descriptions from KEGG and parsed these into MATLAB objects, which were then converted into directed graphs. KGML files contain three types of objects: entries, relations, and reactions. These can be mapped to graphical objects in the associated pathway map (Additional file 1). Only entries (which form nodes, represented as boxes) and relations (represented as edges) were used to represent signalling pathways where proteins (boxes) are linked by “relations”. All three types are used to represent the structure of metabolic pathways in order to capture substrate-enzyme-product relationships where enzymes (boxes) are linked by “relations”, and compounds (circles) are linked by “reactions”. To convert a metabolic pathway into a graph in a rational way, we represented enzymes as nodes in the graph, while substrates and products were used to detect the direction of relations (edges) between nodes (Figure 2). While we acknowledge that is not possible to predict any effect on flux by this rationale, we reasoned that any change in node expression in a metabolic pathway could be of physiological relevance, particularly if nodes were connected.

Figure 2
figure 2

Example of the conversion of a group of reactions in a metabolic pathway (a) into a diagraph (b) after removing redundancy.

Representing pathways as graphs had an additional advantage as it reduced redundancy in that genes were only represented once in any pathway graph. A Depth-First Search (DFS) algorithm, derived from Graph Theory was used to ensure that loops were only counted once.

Pathway scoring

Our method assigned weights to all significant nodes (i.e. DEGs) in a pathway to reflect their topological strength (specifically the number of significant downstream nodes that are pointed to, either directly or via other significant nodes as described previously [27]). A PRS was calculated on the basis of fold-change value and weighting of all significant nodes in the pathway and normalized for pathway size. We also calculated a z-score [29] (with an improvement over earlier implementations in that this was performed after removing redundant genes from pathway descriptions). The software outputs two lists of pathways ranked according to PRS and z-score, saved as both Excel and .mat files for later analysis.

Pathway significance assessment

We then went on to establish the probability of achieving scores at least as high as the PRS score by chance using a non-parametric permutation method. Initially, fold-change values for all expressed microarray genes were permuted. These values were then mapped back onto pathways, and a PRS recalculated. This process was repeated n times, where n is provided by the user through the interface (typically n = 1000). The statistical significance (p-value) of each pathway score was estimated by a comparison between the observed score and the n random scores generated. To achieve more reliable statistical significance evaluation, p-values were adjusted for multiple-test correction by a False Discovery Rate (FDR) method based on a threshold provided by the user. This is described in more detail in our original report [27].

Visualizing enriched pathways

After running the analysis, results are saved as .mat format files for ease of retrieval. By clicking on the pathway name from the list of ranked pathways shown in the table and selecting the option of visualizing a pathway from the interface, a marked-up pathway map will be displayed. Technically, the software will call a pathway mapping web service (REST-based API service) hosted on the KEGG website and pass a number of parameters, including a list of all expressed genes with their fold- changes and specified colours to differentiate DEGs from non-impacted genes. Figure 3 shows a typical pathway map where significant (i.e. above threshold) genes are coloured in red and non-significant (i.e. unchanged or not expressed) in green.

Figure 3
figure 3

A typical marked-up pathway, in this case the KEGG “acute myeloid leukaemia pathway” enriched in an AML dataset (GEO accession #GSE9476); significant genes are coloured in red and non-significant ones in green.

UML for modelling and software description

Herein, we used Unified Modelling Language (UML) to describe, model and visualize the structure and functions of our method by diagrams. There are 14 types of diagrams classified in three categories in UML 2.0 [30], however, in this paper we used only two: class and sequence diagrams. Class diagrams represent static structures or main objects in the software. Figure 4 shows the key classes at the pathway analysis stage. The class “Analysis” is the main class, which provides an interface to run all the services provided by the tool. It has four main attributes:

  • ▪ MicroarrayObject: an object of the class “Microarray_Dataset” built by calling initialiseMicroarray() function (see Additional file 2). This holds the normalised gene expression data, and a list of all genes with their fold-change values.

  • ▪ kgmlObject: an object of the class “KGML_Parser” built by calling the parseKGML() function (see Additional file 3). This holds the static structure of all pathways as a list of objects of “KGML_Path” class that is defined by KGML format. An object of “KGML_Path” represents the structure of one KEGG pathway and is composed of entriesList, reactionsList, and relationsList (see Additional file 1).

  • ▪ PathList: this is a list of objects of the class “Pathway” which is created by calling CreatePathListFromKegg() function (see Additional file 4). This object ultimately holds a list of pathways enriched with reference to a given microarray dataset.

  • ▪ rankedPaths: this object is created by calling the rankPaths() function. It holds the same list of pathways defined by PathsList, but they are ranked in descending order based on PRS values.

Figure 4
figure 4

UML class diagram illustrating the main classes of the package at the pathway analysis stage.

Sequence diagrams were used to represent the functions of the PRS tool according to different types of interactions between objects. As an example, Figure 5 represents the main PRS functions with the following steps:

  1. i.

    Conversion of pathways into graphs by the convertPath2Graph() function, which requires the usage of kgmlObject that holds a list of entries, relations and reactions of all pathways.

  2. ii.

    Using information stored in kgmlObject and PathsList for each graph (see Figure 4), a list of nodes is created (where each node represents one or more genes from the original pathway) and a list of children for each node.

  3. iii.

    Removal of redundant genes, which may be represented many times in the same pathway. Two functions are designed to deal with node redundancy: checkNodeRedundancy() and handleNodeRedundancy().

  4. iv.

    After building a graph for each pathway, graphs are weighted by calling the createWeightedGraphs() function, which uses the DFS algorithm to traverse the nodes of each graph and assign a weight for each significant node taking into account the loops in the graph.

  5. v.

    A pathway regulation score (PRS) is assigned to each weighted graph using the weights of the significant nodes in the graph and other parameters.

Figure 5
figure 5

UML sequence diagram illustrating PRS calculation and pathway ranking.

We implemented all these classes, functions, and DFS algorithm using MATLAB R2010a.

Results and discussion

The objective evaluation of novel enrichment analysis methods is difficult, relying on their ability to discern biological processes already known to be perturbed in disease states. We and others previously attempted this by studying performance across a range of datasets derived from distinct conditions ([27] and references therein). Having extended our algorithm to include biochemical pathways, we performed further analysis on a dataset describing a common metabolic disorder, that of type 2 diabetes mellitus (T2DM). The data were originally created by Taneera et al [31], who compared gene expression levels in RNA isolated from human pancreatic islets taken from 9 type 2 diabetes (T2D) cadaver donors with RNA samples of pancreatic islets derived from 54 non-diabetic cadaver donors. These were hybridised to Affymetrix Human Gene 1.0 ST Arrays, and resulting expression values normalised by Robust Multi-array Analysis (RMA) before being uploaded to the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo; accession #GDS4337). We created an input file containing Affymetrix probe IDs and normalized gene expression data for each of the 63 samples. Other parameters required were sample numbers in each group (9 in group1, 54 in group2 in this case), and fold-change and p-value threshold values to filter significant genes (in this case fold-change ≥1.3 and p-value <0.05). Fold-change thres-holds are arbitrary, and the value selected in this example yielded a sufficient number of impacted genes to allow pathway mapping (in this example, a threshold of 1.5 would have yielded only 88 DEGs). The user can opt to enrich for signalling or metabolic pathways, or both (as in this example). Additional statistical testing can be performed, if required, by our permutation method (in this example we used number of permutations = 1000 and p-value threshold = 0.05). Tables 1 and 2 display the top ten pathways ranked according to PRS and z-scores respectively, where only significant pathways (FDR < 0.05) were selected. A number of processes relevant to T2DM were picked up by both techniques, notably metabolic pathways such as “Arachidonic acid metabolism” [32] and “Fatty acid metabolism” [33],[34], as well as anticipated signalling processes such as “PPAR signalling pathway”[35],[36]. Both techniques detect “Pathways in cancer”, which is unsurprising as this description encompasses a number of processes perturbed in diabetes including apoptosis and the cell cycle, along with TGF-beta signalling [37]. “Complement and coagulation cascades” scored highly with both methods, which could be a false positive or may reflect alterations to the vasculature in diabetic islets. Apart from this exception, all other high-scoring PRS pathways are known to be impacted in diabetic states. Conversely, a number of pathways detected by z-scoring are harder to explain, and so may also be false positives (“Intestinal immune network”, “Cell adhesion molecules”, “Allograft rejection”, “Staphylococcus aureus infection”). Finally, the PRS method afforded greater prominence to two pathways critical to T2DM, “MAPK signalling” [38] and “Type II diabetes mellitus” [39], compared to z-scoring. Indeed, the latter description explicitly reflects the impact on adipocytokine and insulin signalling, which are central to the pathophysiology of diabetes.

Table 1 Top ten pathways ranked by PRS (T2D and pancreatic islets dataset)
Table 2 Top ten pathways ranked by Z-score (T2D and pancreatic islets dataset)

Conclusions

The rapid development of high-throughput genomic technologies and the deposition of their output in open-access databases has produced huge amounts of biolo-gical data. Mining and interpreting these data has driven innovation in the field of computational biology, leading to the emergence of sophisticated tools to produce reliable, meaningful and testable results. This is important as these kinds of experiments are expensive, and new tools are likely to add value to pre-existing analysis.

In this paper, we address two areas; firstly, the extension of our PRS enrichment algorithm [27] to include both metabolic and signalling pathways; and secondly, to provide a detailed description of a GUI that facilitates array analysis by both PRS and z-scoring. The improved tool handles a number of challenges, notably in ID mapping, redundancy in pathway descriptions and statistical significance assessment. Unlike z-scoring, the PRS algorithm takes into account the topology of a pathway (the relationships between genes) and the magnitude of gene expression changes to identify impacted pathways. For these reasons, we argue that PRS enrichment yields more biologically-relevant insights compared to those provided by the standard hypergeometric method. It was not feasible to compare performance to other PT methods as the additional preprocessing steps taken to reduce redundancy in KEGG descriptions are not easily implemented in other methods without considerable re-engineering. The behaviour of signalling and metabolic pathways is, of course, distinct. However, as our approach was to assess transcriptional changes in a pathway, rather than to predict an effect on the function of a pathway, we felt it was reasonable to evaluate impact on signalling and biochemical pathways using a single method. In this way, we were able to detect biochemical pathways known to be perturbed in metabolic disease. A key tenet of this kind of analysis is that biomedical scientists are guided in the subsequent investigation of targets revealed by transcriptional profiling studies. Unfortunately, there is no unambiguous statistical test that allows investigators to be certain that any pathway highlighted is worthy of further study (and considerable expense). The use of permutation-based approaches are commonly used to determine the likelihood of an enrichment score being achieved by chance, and by adjusting P values by FDR can increase investigators’ confidence that a result is meaningful.

In summary, we suggest that providing researchers with a choice of analysis tools, informed by distinct rationales, will allow evidence to be combined or contrasted in order to facilitate more informed decision making.

Availability and requirements

Project name: PRS_software.

Project home page: http://www.buckingham.ac.uk/research/clore-laboratory-diabetes-obesity-and-metabolic-research/staff/maysson-al-haj-ibrahim/prs-tool/.

Operating system(s): Platform independent.

Programming language: MATLAB.

Other requirements: MATLAB 2010a or higher. If MATLAB is not installed on your PC, you need to install the MCR (Matlab Compiler Runtime) environment first and then run the PRS tool.

Restrictions for use: None.

Authors' contributions

MAI conceived the method, generated the code and performed the testing. SJ provided guidance in the use of the DFS algorithm and assisted with statistical analysis. MAC provided invaluable insights during the development process. KL developed the algorithm in collaboration with MAI and assisted with the biological analysis. All authors were involved in preparing the manuscript, and all approved the final draft.

Additional files