Background

Time course gene expression experiments are widely used to study the dynamics of biological processes. Usually, the main goal of such experiments is to identify genes modulated along a biological process or after a system perturbation (such as drug treatments or genetic modifications). However, time course data are costly and usually long time series have few or no replicates. In this context a differentially expressed gene can be defined as a gene with the expression profile changing significantly along time and/or across multiple conditions. Several statistical models have been proposed to account for clusters and differential expression in the contest of time series with [118] and without replicates [10, 1921], but none of them were proposed in the context of pathway analysis. Pathway analysis has acquired great relevance in the last years especially for the ability to increase interpretability of gene expression results. Expression experiments typically provide lists of differentially expressed genes (DEGs) that represent the starting point for result interpretation. This step is not trivial and remains challenging for this type of analysis. The grouping of genes into functionally related entities (such as pathways) is of great help in the interpretation of the results. Several methods have been proposed to this aim, based on very different statistical tests and null hypotheses [22, 23]. Broadly speaking, they can be divided into the classical enrichment analysis [2428], working on gene lists selected through a gene-level test, and the novel global and multivariate approaches [2937], that define a model for the whole gene set (see [22, 3840] for a comprehensive reviews and comparative analysis). The latter can be further divided into 'topological' and 'non-topological' methods according to their ability to gain power from the topology of the pathway [25, 35, 36, 4143].

A pathway is a complex structure comprising chemical compounds mediating interactions and different types of gene groups (e.g. protein complexes or gene families) that are usually represented as single nodes but whose measures are not available using gene expression data. However, after appropriate biologically-driven conversion [44, 45], a biological pathway can be represented as a graph where genes and their interactions are, respectively, nodes and edges of the graph.

Taking advantage of the structure of the graph, Massa et al. [35] used Gaussian graphical model theory to test both differences in mean and in covariance matrices between two experimental conditions. In particular, graphical models are useful to decompose the overall graph (obtained from a pathway) into smaller components (cliques), that can be explored and tested in detail. Martini et al. [36] proposed an extension of this method, called CliPPER, based on a two-step empirical approach. In the first step, it selects pathways with covariance matrices and/or means significantly different between experimental conditions dealing with the p >> n case; in the second step, it identifies the sub-paths (called signal paths) most associated with the phenotype.

Pathway analysis is mainly tailored to two-groups comparisons and few efforts have been dedicated to the time course design. Here, we propose a modification of [36], called timeClip, to deal with long time course data without replicates. Specifically, timeClip combines principal component analysis, regression models and graph decomposition to explore temporal variations across and within pathways. Moreover, timeClip implements an easy and effective visualization of the dynamics of the pathways.

On simulated datasets, timeClip shows good performances in term of power, specificity and sensitivity. Using real data on mouse muscle regeneration [46], we obtain excellent results in agreement with the scientific literature.

Method

Pathway annotation

A critical step in the field of topology based pathway analyses is the availability and the quality of the pathway topology. Our group has recently developed graphite a Bioconductor package for the storage, interpretation and conversion of pathway topology to gene-only networks [44]. graphite discriminates between different types of biological gene groups and propagates gene connections through chemical compounds. Specifically, protein complexes are expanded into a clique (all proteins connected to the others), while the gene families are expanded without connections among them; see [44, 45] for more details. The current version of graphite Bioconductor package is limited to human, so here we build a dedicated graphite package for mouse KEGG pathways. This package is available at http://romualdi.bio.unipd.it/wp-uploads/2013/10/graphite.mmusculus_0.99.2.tar.gz.

timeClip: general approach

A pathway is composed by multiple genes so to reduce the dimension of a whole or of a portion of a pathway, we used principal component analysis. Then the first principal component is explored for temporal variation. A vast amount of techniques exist for analyzing regularly sampled time series. Unfortunately, the irregular sampling of the values (a common practice in biology) makes direct use of such estimation techniques impossible. To avoid the well known biases associated with the most common approach for irregularly sampled time series based on transforming unevenly-spaced data into equally spaced observations using some form of interpolation, here we propose to use a regression model combining a polynomial trend and a continuous-time Gaussian autoregressive process of order 1 (AR(1)). Then, timeClip resembles the two-steps approach of CliPPER. In the first step, the whole pathway is explored for its temporal variation. If the pathway is defined as time-dependent, in the second step, timeClip decomposes the pathway into a junction tree and highlights the portion mostly dependent on time. A general schema of the approach is summarized in Figure 1.

Figure 1
figure 1

timeClip. Global overview of timeClip approach.

Step 1: exploring the whole pathway

Let X n × t be the normalized log transformed gene expression matrix with genes on the rows and experiments (equal to time points t) on the columns. Let X p × t P the sub-matrix of genes belonging to pathway P. Pathway P has p genes. Then, on the transpose of XP, XP', we perform principal component analysis (PCA). We used both the classical (R package stats) and the robust (rrcov R package) version of PCA. Let Z p × t P be the scores matrix and L p × t P the loadings matrix. We call Z 1 P ,, Z p P the p principal components. In this way, the first PCs summarize the temporal variation of the genes in pathway P (if present). Thus, from now on we will indicate Z i P as Z i P ( t ) . A similar approach was recently proposed by [15] (PCA-maSigFun). PCA-maSigFun uses principal component analysis to identify temporally-homogeneous groups of gene within the pathway.

Then, for irregularly sampled time series we assume that our irregularly sampled signal Z i P ( t ) can be decomposed as Z(t) = p(t) + ∈(t), where p(t) is a deterministic function, hereafter called "trend", and (t) is the realization of a stationary stochastic process with mean zero. Extensive exploratory analysis suggests that a reasonable choice for the trend component is a polynomial of degree 2 in t, i.e.,

p ( t ) = β 0 + β 1 t+ β 2 t 2

with β1 capturing existing temporal behaviors of Z 1 P ( t ) and β2 correcting for potential non linearities.

Moreover, we assume that ∈(t) follows a continuous-time Gaussian autoregressive process of order 1. The model is fitted using generalized least squared (as implemented in nlme R package). The representative p - value of pathway P, p P , is then taken to be the p - value of the test of nullity of β1 (obtained by a t-test as implemented in the gls function of the nlme R package). Bonferroni correction is used to adjust p - values for multiple tests.

We evaluated the possibility to fit a polynomial regression not only on the first PC, but also on few additional Z i P , with i = 2, 3. However, we did not find significant improvements in the final list of significant time-dependent pathways (data not shown).

Step 2: decomposing the pathway

Pathways declared as time-dependent in step 1 are then moralized, triangulated and decomposed into a junction tree as described in [36].

Briefly, moralization inserts an undirected edge between two nodes that have a child in common and then eliminates directions on the edges; triangulation inserts edges in the moralized graph so that in the moralized graph all cycles of size ≥ 4 have chords, where a chord is defined as an edge connecting two non-adjacent nodes of a cycle. A clique in the triangulated graph is a complete subgraphs having all its vertices joined by an edge while a junction tree construction is a hyper-tree having cliques as nodes and satisfying the running intersection property according to which, for any cliques C1 and C2 in the tree, every clique on the path connecting C1 and C2 contains C1C2 [36, 47]. For a given graph there could be more than one junction tree. Here we force the root of the junction tree to be in agreement with the structure of the pathway.

A clique k of pathway P, noted as C k P (with k = 1,..., K), is composed by a subset of genes in P, c k P . Let X c k p be the sub-matrix of X corresponding to the genes of the clique C k P . For each clique k of P we apply the same approach as described in step 1: PCA transformation and then a linear model with polynomial trend and autoregressive process of order 1 on the first PCs. The p - value of clique k in pathway P, p C k P is given by the p of the β1 of the polynomial regression. Finally, the best time-dependent paths within a pathway P, hereafter called S P j , j = 1,..., J, are identified using the relevance measure as described in [36]. Briefly, a path is a chain of consecutive time-dependent cliques ( p C k P 0 . 05 ) with gaps at most of size one. Then, for each path in the pathway a cumulative score is calculated along the path: lower the the p -- value of a clique in the path, higher the contribution to the score, in case of gap the score is penalized. The final score of a path is the maximum value reached by the score along the path. Then, the score is normalized for the path length; this quantity is called relevance [36].

As final results, for each time-dependent pathway, we report a list of relevant paths, ranked according to their relevance. Currently, step 2 is the most innovative feature of timeClip and, as far as we known, there are no existing tools using a similar strategy.

Simulated data

As some paths may be declared time-dependent by timeClip step 2 simply as a consequence of type I errors in timeClip step 1, we used a simulation to evaluate the percentage of false positives under the null hypothesis and to estimate the statistical power in different scenarios.

False positive rate estimation

Given a pathway P and its graph structure (G), for 1,000 runs we randomly generate a gene expression matrix Xn × tfrom a multivariate normal distribution with zero mean and variance ∑, with S + ( G ) (where S+(G) is the set of symmetric positive definite matrices with null elements corresponding to the missing edges of G). In this case, gene expression profiles are time independent. Then, for each run we calculate p P (either for the case of irregularly and regularly sampled time points, see Section Step 1: exploring the whole pathway). Under this scenario, at the nominal level α = 0.05 we expect a number of rejections around 5%. We repeat the simulation for different values of n (n = 5, 10, 15, 20, 25, 30) and t (t = 5, 10, 15, 20, 30).

Power estimation

In order to be sure that the model were able to identify time-dependency coming from different models, we simulate data using polynomial models, autoregressive models of order 1 and a combination of both (polynomial models with autocorrelated errors). Then, the power is estimated for irregularly and regularly sampled time points.

Given a pathway P and its graph structure (G), for 1,000 runs we randomly generate a gene expression matrix X(ns) × tfrom a multivariate normal distribution with zero mean and variance ∑ with S + ( G ) . Then, the expression profiles of the remaining s genes, with sn are simulated to have different degree of time-dependency. Specifically, we use polynomial models (Equation 1), autoregressive models of order 1 (Equation 2, where * is a white noise) and the combination of both (Equation 3, where an AR(1)).

x s ( t ) = α 0 + α 1 t+ α 2 t 2 + ϵ *
(1)
x s ( t ) = φ 0 + φ 1 x s ( t - 1 ) + ϵ *
(2)
x s ( t ) = α 0 + α 1 t+ α 2 t 2 + φ 1 ϵ s ( t - 1 )
(3)

The coefficients α* are independently generated from a U(−5, 5), and φi are generated so as to achieve stationarity. In this way, we simulate expression profiles with different degrees of temporal variations. Then, for each run we calculate p P (see Section Step 1: exploring the whole pathway). Under this scenario, the number of rejection estimates the statistical power. We repeat the simulation for different combinations of φ, n, s and t.

Real data: muscle regeneration model

The benchmark dataset used [46] (GSE469) follows mouse muscle regeneration after intra-muscluar injection of cardiotoxin. Regeneration process is followed for 27 unevenly spaced time-points with only two technical replicates for each time-point. Expression data were produced using single channel Affimetrix microarrays. The probes in the platform were annotated with EntrezGene custom CDF (version 14) [48] and data was normalized using the robust multi array analysis (rma) and quantile normalization. Then, technical replicates were averaged to get one measure for every time-point.

Implementation and visualization: the wheel of time

timeClip is implemented as an R package available from the authors. The package allows to analyze equally and non-equally spaced time series according to the user setting. To get better insights into the temporal activation of the different portions of the pathway, we develop a new way of visualization using Cytoscape software [49] and Rcytoscape Bioconductor package. The visualization, called the wheel of time, allows visualizing pie charts inside network nodes. For each pathway, timeClip exports in Cytoscape the structure of the junction tree where each time-dependent clique has a pie chart that represents the time trend. Specifically, the pie is divided into as many slices as the number of time points in the dataset. Each slice in the pie is colored (from green to red) according to the scores of first principal component: the higher the value, the stronger the activation of a clique in a specific time point (red color) and viceversa (green).

Results and discussion

Many biological processes need to be followed and monitored along time. In these cases time course designs are ideals: higher the number of time points, finest the monitoring process. However, long time courses are often characterized by small or no replicates. Here, we present timeClip, a two-step approach to perform topological pathway analysis for time course gene expression data, specifically tailored to long time series without replicates (Figure 1). In the first step, we select pathways that show time dependency. In the second step, the selected pathways are decomposed into cliques and the time-dependent portions are isolated. In the next sections, we will show the performance of timeClip using simulated and real datasets.

Simulations results

Two simulation strategies have been considered. The first one was designed to estimate the number of false positives under the null hypothesis of no temporal variation, the second to estimate the statistical power (see section method for details).

Table 1 and Table S1 (Additional file 1) report the percentage of false positives obtained with different n and t for the irregularly and regularly sampled time points, respectively. The average false positive percentage for each t and n is always limited to ~4-5%, with the exception of small time series (t = 5) and equally spaced time points where it is slightly higher. Thus, we can conclude that, in general, for long time series we have an excellent control of type I error even with exceptionally low sample sizes.

Table 1 Simulation results - False positives rate with different pathway dimensions n and irregularly sampled time points t.

Table 2 and Table S2 (Additional file 1) report the number of true positives obtained with n = 30 and different t and s for equally and not-equally spaced time points respectively. Here, the genes with temporal variation are simulated using different models (if s is the number of time-dependent genes among the n of the pathway, we simulate s/3 with polynomial, s/3 with AR(1) and s/3 with the combination of both). As expected, the power increases with the increase of t and s: the longer the time course and the higher the number of time dependent genes s within the pathway, the higher the power.

Table 2 Simulation results - Power estimate in case of n = 30 and different time course length t and time dependent genes s.

Specifically, when the time course is short (t = 10 − 20) the maximum power reaches 60%, while with long time series t = 30 the power is above 80%. Moreover, it is worth noting that the increase of the time dependent genes does not affect significantly the power level. The greater impact that the number of time points has on statistical power with respect to the number of time-depending genes can be explained by the presence of two steps in our strategy: i) a data reduction step (with PCA on genes within pathways) and ii) a model-fitting step of the reduced variables on time points. PCA is an efficient method to detect variance components in the data. Thus, even in case of a small number of time-dependent genes, the first PC is able to capture the time trend when present. On the other hand, once the trend is captured, the goodness of fit of the regression model increases by increasing the number of time points. The use of robust PCA does not change the performance of the method substantially (data not shown).

Case study: muscle regeneration model

Step 1 results

In step 1 every pathway is explored for its temporal dependence. In the benchmark dataset, we have to deal with 27 not equally spaced times (14 of which are equally spaced).

Comparing step 1 results for equally and not equally spaced time-point we obtain an overlap of 70%. This high degree of overlap makes us confident about the reliability of our approach. We summarized the results in the heat map of Figure 2 (values reported in Additional file 2). The heat map is obtained using the scores of the first principal component of each time-dependent pathway. From the unsupervised cluster analysis, we can define 3 pathway groups characterized respectively by a 'very early','early-intermediate' and 'intermediate-late' activation. Pathways characterized by a very early activation like 'Malaria' and 'African trypanosomiasis' reflect the early activation of the inflammation processes deputed to clean injured fibers. These processes are carried-out by macrophages that have a central role in the 'Malaria' and 'Africa trypanosomiasis' pathways. Macrophages clean up injured fiber and release growth factors like vascular endothelial growth factor (VEGF) and hepatocyte growth factor (HGF) [50].

Figure 2
figure 2

Heat map of pathway PCs. Heat map colored according to the expression of the first PCs from green to red. According to the color pattern, pathways are divided in early, early-intermediate and late-intermediate. Time is measured in hours after treatment.

In the early-intermediate pathway group, we can see the effects of the early signal secretion: in fact, the group contains pathways like 'mTOR signaling pathway', 'VEGF signaling pathway', 'Insulin signaling pathway' and other metabolic pathways like 'Ether lipid metabolism' and 'Citrate cycle (TCA cycle)'. Globally, these pathways indicate that the regeneration progress has begun.

'mTOR signaling pathway', probably the most important pathway in the muscle regeneration, on one side sustains VEGF signaling and on the other promotes protein production needed for clonal expansion of the myoblasts, their growth and fusion. In particular, mTOR integrates growth factor signaling with a variety of signals from nutrients (amino acids metabolism activate mTOR pathway) and cellular energy status [51]. The energy status of the cell is indeed monitored by those pathways involved in energy metabolism like 'carbohydrate digestion and adsorption', 'Citrate cycle (TCA cycle)' and 'Fatty acid metabolism'. These processes are very important in the regeneration process, in fact, it was demonstrated that glycolitic metabolism is restored after three days from myofibril formation [52].

Intermediate-late activation pathways mainly present pathways involved in inflammatory responses like 'B and T Cell receptor signaling pathway', 'Toll-like receptor signaling pathway', 'Adipocytokine signaling pathway' and 'Leukocyte transendothelial migration'. Recent discoveries reveal complex interactions between skeletal muscle and the immune system that regulate all phases of the muscle regeneration [50]. Moreover in this pathway group there is the 'Axon guidance' and 'Dopaminergic synapse' pathways that are involved in nervous impulse transduction. We can speculate that at the end of the regenerative processes nervous system can contact the restored contractile cells to ensure and maintain their functionality.

This contains also pathways involved in signaling transduction like 'HIF-1 signaling pathway'. HIF-1 has been recently demonstrated to be essential for skeletal muscle regeneration in mice [53]. In fact this pathway manages a plethora of signals and interface with pathways like mTOR signaling pathway, PI3K-Akt signaling pathway, MAPK signaling pathway, Citrate cycle (TCA scycle), Calcium signaling pathway, VEGF signaling pathway and Ubiquitin mediated proteolysis. Together with all these pathways, 'HIF-1 signaling pathway' finely tune the balance between oxygen consumption.

In step 1, we are able to see only the strongest signals and not always the pathway name alone reflects the activity of the pathway. To tackle the complexity of the pathway, timeClip step 2 deeply investigates the timing activation of different portion of the pathway.

Step 2 results

In the second step, we focused on the the Akt-mammalian target of rapamycin (mTOR) signaling pathway. It regulates a pletora of signals: cell growth, VEGF signaling pathway, autophagy and its action is related to other pathways known to be involved in the muscle regeneration like Insulin signaling pathway and MAPK signaling pathway [54].

The junction tree of mTOR signaling pathway (Figure 3A) starts with Igf1 (Insulin-like growth factor 1) as represented in the KEGG map (Figure 3B). Within mTOR signaling pathway we identified a total of 6 paths, ranked by their relevance score (Table 3).

Figure 3
figure 3

Activation of the mTOR signaling pathway. Panel A. Junction tree of the mTOR signaling pathway (using graphite R package and database KEGG). The top ranked time-dependent paths identified in timeClip step 2 are highlighted using the wheel of time visualization. Panel B. KEGG representation of mTOR signaling pathway. Genes are colored according to the paths in panel A. Panel C. Enlargement of the wheels of time representative of the main block of mTOR signaling pathway: from t0 to t27 (clock-wise) every slice of the pie is colored according to the value of the clique first PC (green means no activation; red means activation).

Table 3 mTOR signaling pathway: relevant paths identified by timeClip step 2

The most relevant of these paths goes from the 1st to the 21st clique and contains 16 cliques. The second and the third path share a big portion with the first one. This big portion goes from clique 1 to cliques 13 (blue nodes on the junction tree - Figure 3A) and contains genes like Igf1, Insulin, Mapk3, Mtor and Akt that globally represent the backbone of the pathway where the starting activating signal is regulated by Igf1. Then Pi3k, Mapk and Akt translate the signal and activate Mtor that organize the effectors. From the junction tree we can identify three different terminal effectors: the first, in pink, is the portion that brings to the VEGF signaling pathway. The second, in purple, is the regulation of autophagy and the third, in yellow, is the regulation of protein synthesis that is necessary for the skeletal muscle mass recovery during regenerating processes [55]. In the panel C of Figure 3, we summarized the timing of the 'mTOR signaling pathway' activation. With the wheel of time, we can see that the pathway backbone is activated in the early phases. The portion that brings to the VEGF signaling pathway is activated in the late phases. The effectors that bring to authophagy are switched off at the end of the regenerative precess while the activation of the protein synthesis begun from the early-intermediate phases and last till the end of the process.

Recently, as discussed before, it was demonstrated the involvement of HIF-1 in the skeletal muscle regeneration process [53]. We observed that the most relevant path of HIF-1 signaling pathway is 37 cliques long underlining its importance in this process. This path is activated by different growth factors (Igf, Ins, Egf) and signals are translated through Akt and mTOR towards HIF-1α/β. Hif-1α regulated many processes from the oxygen balance to apoptosis (See Additional file 3). Such downstream effectors confirm its importance in skeletal muscle regeneration in accordance with results obtained from [53].

Comparison with other methods

In this section we compare timeClip step 1 results with the methods proposed by [15]. Step 2, that is the most innovative feature of timeClip, cannot be compared to any existing tool. [15] proposed two different strategies. The first one, called maSigFun, considers individual genes as different observations of the expression profile of the pathway. The second approach PCA-maSigFun uses PCA to identify groups of genes showing different time-dependencies. maSigFun did not give any significant time-dependent pathway using our dataset describing skeletal muscle regeneration (p ≤ 0.05), while PCA-maSigFun returned 59 significant KEGG pathways (p ≤ 0.05). 26 out of 59 (44%) pathways are in common with timeClip step 1 results. Indeed, both the methods retrieve mTOR signaling pathway, however PCA-maSigFun did not call HIF signaling pathway as significant, although it seems to be closely related to the muscle regeneration [53]. Most of the PCA-maSigFun specific pathways (15 out of 33) referred to metabolic processes like Inositol phosphate methabolism, Pyruvate metabolism, Tyrosine metabolism, Glycerolipid metabolism. The remaining pathways are highly heterogeneous and comprise Acute myeloid leukemia, Bladder cancer, Melanoma, Pancreatic cancer.

Conclusions

Pathway analysis is a useful and widely used statistical approach to test groups of genes between two or more biological conditions. Although many efforts have been dedicated to implement novel gene set analysis in a multivariate and topological contexts, few of them deal with time course experiments. Time course experiments are used to monitor the dynamics of biological processes under physiological conditions or after perturbations.

In this context there is a clear trade-off between the number of time points and the number of replicates. In general, if the goal of the study is the identification of time-dependency, long time course are required at the expense of replicates; on the other hand, if the goal is the characterization of short term response a large number of replicates for each time point is required to increase statistical power. In general, there are few long time series datasets and in our opinion this is partly due to the experimental costs but also to the lack of effective methods to study and interpret results. Here, we present timeClip, an empirical two-step approach specifically tailored to long time course gene expression data without replicates. Using simulated data timeClip shows good performance in terms of controlling type I error and power. Furthermore, we successfully identify most of the key pathways involved in the early, middle and late phases of the skeletal muscle regeneration process. A visualization tool has also been implemented to tackle the dynamics of the transcriptome.