Background

The sequencing technologies have provided major advances in the understanding of biological mechanisms. Particularly, within these sequencing technologies, the RNA-Seq has contributed to understanding gene expression, changing our view of the transcriptome [1, 2]. The identification of differentially expressed genes, new transcripts, expressed mutations, among others, has allowed a better understanding of human diseases. New biomarkers or therapeutic targets against diseases such as cancer have been proposed using this technology [3].

However, there is no standard pipeline for the analysis of RNA-Seq data. In fact, each step of the analysis admits many options. The reads can be aligned (or mapped) using different tools. Some widely used aligners are STAR [4], Tophat [5] or Bowtie [6]. Then, the matrix of counts is obtained, i.e the estimation of RNA abundance (cDNA) by the number of aligned read over a gene or isoform. These counts can be obtained using software like HTSeq [7] or featureCounts function of the Rsubread package [8]. The differential expression analysis can be done using the widely used edgeR [9], DESeq [10], among others. Besides, the RNA-Seq data results can be different for each pipeline and it is not established which is the best analysis protocol [11].

There are (and will be) many challenges to solve in mapping, read count and statistical analysis. In this sense, the major problem in statistical analysis of RNA-Seq, and in all omics data, is the small sample size with respect to the large number of variables (genes, isoforms, exon, …). It is not rare that just a few samples determine the results i.e. a great variation accounted by a few observations. Additionally, there exists important confounding variables in the differential expression analysis. They are the library size, the gene length and others [11, 12]. It is not rare that a first differential expression analysis provides several candidate genes that are not significant in a posterior experimental validation. Thereby, the RNA-Seq do not provide accurate absolute measurements [12]. In order to solved it, new methods for RNA-Seq data analysis have been developed [13, 14].

In this paper, we propose a new method for the differential expression RNA-Seq analysis with paired design. Our approach proposes to compare the counts within each pair by taking into account library sizes [15]. The p-values for all pairs corresponding to a given gene are aggregated using ordered weighted averages [16]. This aggregated value will quantify the phenotype-expression association from the gene expression profile. These values are used to test differential expression using randomization distributions. Our approach is compared with edgeR [9] and DESeq2 [17] methods for paired samples.

The methodology have been tested using a 68 pairs data set from patients with colorectal cancer. Of these, 50 are obtained from The Cancer Genome Atlas (TCGA) [18] and 18 from PRJNA218851 BioProject [19, 20].

Each pair is composed with a sample from solid tumor and adjacent normal tissue from the same individual. The new methodology has been implemented in the R package OMICfpp and is available at http://www.uv.es/ayala/software/OMICfpp_0.2.tar.gz.

Methods

Data

A colorectal cancer paired data set of 50 patients (tumor and normal adjacent tissue) were downloaded from TCGA [18] using gdc-client tool. In addition, a colorectal cancer data set of 18 pairs of samples were downloaded from SRA, PRJNA218851 BioProject [19] using the SRA toolkit [20]. The quality control of the PRJNA218851 raw dataset was checked using the FASTQC tool and low quality reads were discarded using fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). Later, the reads were mapped using STAR with the GRCh38 human genome as the reference one. Then the SAM files were converted to sorted BAM files using Samtools [21]. Finally, the count matrix was generated using the summarizeOverlaps function of GenomicAlignments R package [22]. At this point, we have the counts of both data sets (PRJNA218851 and TCGA), so they are included in a single matrix using SummarizedExperiment R package [23]. A detailed description can be found in the Additional file 1: Methods.

OMICfpp methodology

The major problem in statistical analysis of omics data is the small sample size with respect to the large number of variables (genes, exons, locii, …). From an statistical point of view we are dealing with counts and covariables describing the samples i.e. a count response model is the suitable approach. These models are part of the generalized linear models and should be the natural approach. However, the small sample sizes do make it more difficult to apply such kind of models. In this paper, we propose a method for RNA-Seq data in paired designs where we tackle the issue of small sample.

In our approach, a p-value for each case-control pair is obtained, using a binomial test. These p-values are aggregated using an ordered weighted average (OWA) with a given orness previously chosen by the user or using the chooseOrness function (from the package OMICfpp) for the automatic orness choice. The aggregated p-value from the original data is compared with the aggregated p-value obtained using the same method applied to random pairs. These new pairs are generated using a randomization distribution (“Randomization distributions” section). This randomization p-value is used as a raw p-value to test the differential expression of each gene (“Marginal gene analysis” section). Figure 1a displays the outline of our approach. A detailed software implementation is contained in Additional file 1: Methods.

Fig. 1
figure 1

OMICfpp method. a) Workflow used by OMICfpp paired data analysis. The 68 paired RNA-Seq data from TCGA and PRJNA218851 BioProject were analyzed by our proposed method OMICfpp and by conventional methods edgeR and DESEq2. In the OMICfpp approach, original and randomized p-values are obtained for each paired data, applying different randomization distributions. The p-values must be aggregated using the OWA to obtain a single value per gene. The user decides, by choosing an orness, the weights assigned to the genes. Finally, a marginal gene analysis is performed and a list of genes ordered by importance according to the assigned weights is obtained. These results are compared with those obtained using edgeR. b) IL11, c) HIST2H3C and d) AC012414.3 are examples of genes with area under the cumulative distribution function, respectively. Top-left, the kernel density estimator corresponding to the original p-values of the binomial test; top-right: the corresponding cumulative distribution function of these original p-values; bottom-left: the between-pair p-values corresponding to all the values of orness used in the study; bottom-right, the complete p-values corresponding to all orness. e) Optimal orness by comparing n0 extreme genes. f) Proportion of significant genes for different α values obtained using the complete distribution. g) Density function used in “Results” section to calculate the score of Eq. 3

Randomization distributions

The data are paired samples. It will be denoted as (yi1,yi2) the i-th pair of counts for a given gene. The whole expression profile would be (yi1,yi2) with i=1,…,n with 2n samples and N genes. We are going to consider different randomization distributions.

  • Between-pairs. The first element of each pair is maintained as the original one. The second element of each pair is obtained permuting the second components of all pairs between them. We have (yi,1,yγ(i),2) for i=1,…,n where γ is now a permutation of (1,…,n). The number of possible permutations is n!.

  • Complete. Let us choose I={i1,…,in} a random subset of {1,…,2n}. The indices of {1,…,2n} not in {i1,…,in} can be denoted J={j1,…,jn}. A random correspondence between I and J will produce the pairs. Cases can be considered controls and the pairs are randomly assigned too. The number of possible values is \(\frac {(2n)!}{n!}\).

From now on, they will be named between-pair and complete distributions. Let (y1,y2) be a pair of counts to be compared and (m1,m2) the corresponding library sizes. A simple approach to compare the counts by taking into account the library sizes was proposed in [15]. In fact, assuming given the total number of counts per gene and the library sizes, we can test the null hypothesis Hi:pi1=m1/(m1+m2) against Hi:pi1m1/(m1+m2) where pi1 is the proportion of the i-th gene in the first sample. Under the null hypothesis, the statistic Yi1 follows a binomial distribution with Yi1+Yi2 trials and the success probability m1/(m1+m2). Note that the null distribution assume that the (random) value of Yi1+Yi2 is given.

Other testing procedures for this null hypothesis could be used and incorporated in our approach. For a given statistical test and for the i-th gene we will have \(\phantom {\dot {i}\!}(t_{i1},\ldots,t_{i_{n}})\) where tij is the statistic or p-value obtained in the j-th test. It is well known that a few pairs could produce extreme values of these statistics. The simplest approach could be to aggregate the values \(\phantom {\dot {i}\!}(t_{i1},\ldots,t_{i_{n}})\) using the mean or a median. In our opinion, a more general and really interesting point of view is to use ordered weighted averages (in short, OWA) [16].

Let us remember this aggregation operators. Let a=(a1,…,an) be the column vector of values aggregated and a is the transpose of the column vector a. Let \(\phantom {\dot {i}\!}\boldsymbol {a}_{r} = (a_{r_{1}},\ldots,a_{r_{n}})'\) be the ordered version a i.e. \(a_{r_{1}} \geq \ldots \geq a_{r_{n}}\). An ordered weighted average (OWA) operator of dimension n is a mapping \(f: {\mathbb R}^{n} \rightarrow {\mathbb R}\) with an associated weighting vector w=(w1,…,wn) such that \({\sum \nolimits }_{j=1}^{n} w_{j} =1\) and where \(f(a_{1},\ldots,a_{n})={\sum \nolimits }_{j=1}^{n} w_{j} a_{r_{j}} = \boldsymbol {w}' \boldsymbol {a}_{r}\). The particular cases shown in Table 1 can better illustrate the idea underlying OWA operators.

Table 1 OWA aggregation values using ascending order

In this paper we have used the weights proposed in [24]. The method uses, for an orness δ, the probability function of a binomial distribution with n−1 trials and success probability 1−δ: \(w_{i} = \binom {n-1}{i-1} (1-\delta)^{i-1} \delta ^{n-i}\) for i=1,…,n. No weight is associated with any particular input. The relative magnitude of the input decides which weight corresponds to each input. We have chosen this approach with the following problem in mind. A major problem with paired RNA-Seq counts is that just a single pair of samples is responsible for the global observed difference or global effect. The whole pair or just an element of the pair could be an outlier or a real observation. The OWA operator permit us to control the influence of a particular pair. Each pair is marginally evaluated and the obtained statistics (p-values) are aggregated by taking into account their ordered values.

The OWA operators are bounded by the maximum and minimum operator. Yager [16] introduced a measure called orness to characterize the degree to which the aggregation is like an or (max) operation:

$$ \text{orness}(\boldsymbol{w}) = \frac{1}{n-1}\sum\limits_{i=1}^{n} (n-i)w_{i}. $$
(1)

Note that orness((1,0,…,0))=1, orness((0,0,…,1))=0 and \(\text {orness} \left ((\frac {1}{n},\frac {1}{n}, \ldots,\frac {1}{n}) \right)= 0.5\).

Up to now the OWA has been presented using the usual decreasing ordering. If the original values are increasingly ordered then the interpretation change. In our experiment we will aggregate p-values and these p-values will be increasingly ordered per gene, from the most significant pair (lowest p-value) to the less significant pair (highest p-value). An orness near 1 corresponds to the minimum of the p-values and an orness near 0 corresponds with the maximum of the p-values. Thus, an orness close to one uses the most significant pairs and an orness close to zero will use the less significant pairs. So, when the orness goes from 0 to 1, we are going from the maximum to the minimum of the p-values.

Marginal gene analysis

The original pairs for a given gene are \(\left (y_{i1}^{(0)},y_{i2}^{(0)}\right)\) for i=1,…,n. First, we choose a given orness δ and calculate the weights w. Second, we choose a test to compare both counts, between-pair or complete. Third, we choose a randomization distribution and generates B realizations using it being \(\left (y^{(b)}_{i1},y^{(b)}_{i2}\right)\) (with i=1,…,n) the b-th realization generated. The statistics observed (for the n comparisons) corresponding to the b-th realization generated will be \(\boldsymbol {t}^{(b)} = \left (t_{1}^{(b)},\ldots,t_{n}^{(b)}\right)\) where b=0 corresponds with the original data. The corresponding p-values under the null hypothesis of no association with the phenotype would be \(\boldsymbol {p}^{(b)} = \left (p_{1}^{(b)},\ldots,p_{n}^{(b)}\right)\). Fourth, we aggregate the generated p-values using an ordered weighted average. The b-th aggregated value will be \(v_{b} = {\sum \nolimits }_{j=1}^{n} w_{j} p_{r_{j}}^{(b)} = \boldsymbol {w}' \boldsymbol {p}_{r}^{(b)}\).

Under the null distribution (any of them) the value v0 is like v1,…,vB and any possible ordering of the vector (v0,v1,…,vB) has the same probability. If a one-tail test is used where low values correspond to the alternative hypothesis then the randomization p-value is given by

$$ p = \frac{|\{b: b=1,\ldots,B; v_{b} < v_{0} \}|}{B}, $$
(2)

where |·| denotes the number of elements. This p-value measures how extreme is v0 with respect to the others vbs and depends on the δ-orness used and the randomization distribution chosen. From now on, it will be denoted pb(δ) and pc(δ) for the between and complete distributions and a δ orness.

The between-pair p-values are evaluating the pair (or sample) factor i.e. we are looking for if there is a pair effect. Different orness will permit us to focus over a certain number of pairs from the lowest to the highest significant pairs. We are going to comment some genes in order to understand the utility of these p-values. We think that their interest is not just to declare a gene as significant or non significant. They shows a wider evaluation of the differential expression of the gene with respect to the pair effect (possibly outlier pairs) when the between-pair distribution is used and the condition effect (control vs cases) when the complete distribution is evaluated.

We have chosen three genes of the data used in section “Results” corresponding to extreme cases. Figures 1b, c and d shows a simple graphical description of the different p-values used in our approach. The top-left plot shows a kernel density estimator of the raw p-values corresponding to the original pairs. The top-right plot shows the empirical cumulative distribution function of these raw p-values. The bottom-left (respectively bottom-right) plot shows the between-pair (respectively complete) p-values for the different values of orness.

The first gene, Fig. 1b is a significant one with low p-values for all pairs. No outlier pair i.e. no pair with a clearly different p-value with respect to the other pairs. This can be seen in the plot bottom-left where the pb is horizontal. The bottom-right shows the gene is considered as significant using any orness.

The second and third genes, Fig. 1c and d are non significant genes, for all orness in all samples Fig. 1c and for some orness Fig. 1d. The gene in Fig. 1d has the highest area under the cumulative distribution function of the original p-values. The mass probability is close to one. It is clear in the cumulative distribution function almost null along the whole unit interval.

Differential expression using edgeR and DESeq2

In order to compare our results with the most used methodologies for differential expression analysis, we analyzed the data using the Bioconductor packages edgeR [9] and DESeq2 [17]. The Additional file 1: Methods contains the code and further details in order to reproduce these studies.

Results

Choosing an orness

Many possible methods could be proposed for the orness choice. We have implemented the following procedure where no prior knowledge of the user is assumed. It is a non supervised method. First, an small number of simulations is performed and the randomization p-values per gene corresponding to a set of orness values are calculated. For instance, we can take ten simulations and orness from 0.01 to 0.99 with a grid of 50 points. For each orness, we evaluate the mean of the largest n0p-values and the mean of the lowest n0p-values. We choose the orness corresponding to the largest difference between them i.e. the orness where the significant and non significant genes are more clearly distinguished. This is evaluated for different n0 values and the orness with the greater difference is chosen. The evaluated n0 values have to be chosen in such a way that the two gene set are clearly contained in the significant and non significant gene sets respectively. It is implemented in the function chooseOrness of the OMICfpp package. The simulation study will use this function. Note that the idea is to choose the orness comparing clearly significant and non significant gene sets. It is convenient to have a previous estimation about the fractions of both kind of genes. It can be estimated by using the procedure proposed in [25] and implemented in the R package [26]. We have used it to choose n_0. The details are in Additional file 1: Methods.

For our data set, values for n0 from 100 to 10000 with an increment of 10 were chosen. The number of estimated non significant genes (using the method in [25]) gives us a number around 11000 genes, thus we explore up to 10000. For each n0 the optimal orness is calculated (Fig. 1e). It is clear that there are two clearly defined intervals of n0 with the same orness within the interval. This figure suggests two possible orness values: 0.37 and 0.93.

The closer the orness to 1, the more stringent is the selection of differentially expressed genes (Fig. 1f and g). Thus, only genes that are significant in most or all samples are reported. It is not always the case that a gene is differentially expressed in all patients, especially when the sample size increases. So, choosing values of orness in the range [0.8, 1] could be a excluding selection. On the other hand, choosing the range [0,0.2] is too permissive. This is illustrated in Fig. 1f, where the proportion of genes with complete p-values lesser than a α value (from 0.1 to 0.01) in each δ orness value are evaluated.

However, the orness could be chosen according to an expert judgment based on previous knowledge. First, a small set of genes with differential expression experimentally verified and a set of housekeeping genes i.e. genes with no differential expression, can be proposed. In this case, we are concerned with colorectal cancer (CRC) data set. Thus, information from the TCGA project, through the web server for cancer and normal gene expression profiling (GEPIA) [27] can be used to select a set of genes with validated differential expression in CRC and other set of housekeeping genes. For instance, the genes CDH3 [28], IL11 [29] or SLC11A1 [30] are experimentally validated as differentially expressed in CRC. Also, the genes HIST2H3C, ACTB or RPS23 do not present differential expression in TCGA, have a constitutive function and are not previously described association with CRC, thus can be used as housekeeping. We can replace the data driven procedure with a supervised selection of significant and non significant gene sets.

Finally, the user could choose the orness according with a type of strategy. For instance, a greedy choice could be to use orness close to one i.e. looking for the most significant pairs. Also, a conservative strategies can be choose an orness of 0.5 and a inclusive strategy would use values close to zero i.e. close to the maximum of the p-values using the less significant pairs.

OMICfpp results using an orness value

The between pairs distribution reports the difference between pairs allowing us to identify the influence of outlier pairs. On the other hand, the complete distribution allows us reporting the differences between the controls and cases i.e. the evaluation of the experimental condition. Our methodology allows the evaluation of both experimental factors, although the condition (colorectal cancer in our experimental study) will be evaluated using the complete distribution.

The randomization p-values have been estimated using 1000 realizations. Two thousand eight hundred ninety seven genes were differentially expressed using an orness value of 0.37 and 1564 with an orness of 0.93 (p-value < 0.001, see Additional file 1: Results). Of these, 501 genes were reported in common. We pretend to order the genes using the p-values. Obviously, if we have such a large number of null p-values, they can non ordered using only this p-value. Then, we have used a second ordering criteria using the score proposed.

The top 30 genes for both orness value are shown in Table 2 and a bibliographic search was conducted in order to determine if the genes of each list were previously reported and validated. It has been found that, using an orness value of 0.37, 46,67% and 70% of the genes were previously reported in colorectal cancer and in another type of cancer, respectively. In addition, the bold entries show that 56.6% of the first 30 genes were also reported by other methods.

Table 2 The top 30 genes with differential expression reported by the OMICfpp method using an 0.37 and 0.93 orness value, respectively, with the complete distribution

Figure 2a displays the randomization p-value observed for all the orness values. Although, not all genes have a consistent low p-value pattern in a wide range of orness values, all of them show a null p-value around the 0.37 orness. For instance, genes such as ITGAL, IBSP and GUCA2B have a consistent low p-value pattern, Fig. 2a, and its differential expression in colorectal cancer were verified in previous studies (Table 2). Moreover, some genes that have less clearly defined profiles as MAGEC2, Fig. 2a, have also been experimentally validated (Table 2). However, according to our results, it is not differentially expressed in all patients, which is confirmed in the bibliography [31]. This demonstrates the utility of randomized p-value profiles for target gene selection. Thus, we can suggest, that the same result can be occur by the JADE2, ANO2 or MCF2L2 genes, that have not been previously reported.

Fig. 2
figure 2

Randomization p-values for the 30 most significant genes using an orness value of (a) 0.37, (b) 0.93 and (c) the interval score

The results obtained using an orness of 0.93, show that 43.3% and 73.3% of the genes were previously reported in CRC and in another type of cancer, respectively (Table 2). In addition, the bold entries show that only 26.67% of the first 30 genes were also reported by other methods. The genes ANOS1, CNTN1 or ARNTL2, with a well defined randomization p-values pattern and the genes MAD1L1 or ABCC2, with less clearly defined profiles (Fig. 2b), have been experimentally validated in CRC (Table 2). Also, a considerable number of the first 30 genes (33.3%) reported using an orness of 0.93, were previously experimentally validated.

In view of all the above, we suggest that the genes reported as differentially expressed using OMICfpp, which have not been previously reported in the bibliography, have a high probability of being validated experimentally. Especially those genes that present a defined profile in the randomized p-values pattern graphic and, to a lesser extent, those in which the randomized p-values pattern is less defined.

Ordering genes

We have selected in “Choosing an orness” section just two orness values according with an unsupervised method. However, it seems very interesting to explore the results using not just one or two orness values. Instead, we can use many orness values in order to sort the genes in the study. Note that for a given δ-orness the value pc(δ), randomization p-value using the complete distribution and a δ-orness, could be interpreted as the membership degree (in fuzzy set terminology) of this gene to be non significant i.e. to belong to the set of “non significant genes”. A high pc(δ) corresponds to non significant gene. The integral of this p-value with respect to the orness is a good quantification to be used to order the genes from most significant (lowest area) to lowest significant gene (highest area). This area is given by

$$ A = \int_{0}^{1} p_{c}(\delta) \psi(\delta) d\delta, $$
(3)

where ψ is a density function over the unit interval [0,1]. This aggregated value is like a mean membership degree of the gene to “non significant genes”. This value is calculated for all genes and ordered in increasing order from the most significant to the lowest significant gene. The ordering obtained is consistent with the results in the next section and the whole list can be found in the file score_complete.html in Additional file 1: Results. Note that we use only the complete p-value because we are interested in the differential expression. The ordering using the between-pair distribution would order the genes according with the importance of particular pairs to the differential expression.

For our data set the density used for the orness is a beta distribution with parameters (4.310396,1.977092) shown in Fig. 1g. The automatic procedure suggested to use two possible orness values, 0.37 and 0.93. A common criteria is to use an orness close to 0.5 i.e. close to the average. Following this idea, we have chosen a beta distribution giving a probability 0.9 to the interval [.37,.93], a probability 0.05 to the interval [0,0.37] and a probability of 0.05 to the interval [0.93,1]. The probability mass is mainly concentrated in the central interval (0.9) y the two other intervals concentrate an small probability (0.05 each one).

We identify a total of 26 genes with p-value < 1.05e-12. The first 30 genes reported are shown in the Table 3. The results indicated that 83.3% of the first 30 genes reported are also reported using edgeR and DESeq2 methods. In addition, only a 30% of the genes have been reported in the CRC bibliography. The randomization p-values profiles for the genes of the Table 3, are shown in the Fig. 2c.

Table 3 The top 30 genes with differential expression reported by the OMICfpp method using the interval score evaluated using the complete distribution

OMICfpp in a small sample size context

This section contains a small study about the sample size. In our case the sample size refers to the number of pairs used in the study. The original data used in the paper has 68 pairs. We have reproduced the study with two random samples from the original pairs. Firstly we have used 10 pairs and, secondly, a total number of 20 pairs. The results obtained with 10, 20 and 68 pairs will be compared (Fig. 3). We have considered two α values and evaluated the number of genes with a p-value lesser than α for each orness value. In fact, we have plotted the fraction of common significant genes with respect to the total number of genes in the study. These α values show two typical behaviour in these plots. Figure 3a corresponds with α=0.001. Shows a great overlapping between the results for 68 and 20 pairs for small values of orness. The number of these genes decreases for higher values of orness. Similar comment can be applied to the comparison of 68 with 10 pairs.

Fig. 3
figure 3

(a) Number of genes declared significant i.e. complete p-value lesser than α=0.001 using 68, 20 and 10 samples. Label n68.n20 corresponds with the proportion of genes declared significant using 68 and 20 for different orness values. Labels n68.n10 and n20.n10 corresponds with the proportion declared significant using 68 and 10 samples and 20 and 10 samples. (b) The corresponding proportions with respect to the number of significant genes using 68 pairs

As it could be expected when α is greater the number of common genes between the three studies is clearly greater. Figure 3b corresponds with α=0.001. The power of the study with 68 is much greater and only when we declare significant genes with a higher threshold the results are more similar.

Comparing OMICfpp, edgeR and DESeq2

We have compared our results with those obtained using the methods edgeR and DESeq2. We had four different methods per gene and four p-values for them. The two first will be the complete randomization p-values corresponding to the orness values 0.37 and 0.93. The third p-value corresponds to the method implemented in the package edgeR [9] and the fourth corresponds to DESeq2 [17].

In order to compare the significant genes by taking into account the four criteria an α value of 0.001 have been chose. The significant genes for a given p-value is composed by those genes with the p-value < 0.001.

Under our analysis, edgeR reported 15860 significant genes and DESeq2 reported 15563 genes. Of these, 13589 genes are reported by both methods and 86.5% of these are not reported by OMICfpp (Fig. 4a). OMICfpp method reported 2897 and 1564 genes using an orness value of 0.37 and 0.93, respectively.

Fig. 4
figure 4

Comparison between OMICfpp, edgeR and DESeq2 results. a) Venn diagram comparing the genes with raw p-values less than 0.001, using a OMICfpp and the p-value obtained by edgeR and DESeq2. b) Simulated data set using Poisson distributions. Differences between p-values using different methods, different types of genes and all orness. Rows correspond to the comparisons between methods: bc, between-pair vs complete distributions; be, between-pair distribution vs edgeR method; ce, complete distribution vs edgeR methods. c) Simulated data set using Negative binomial distributions. Differences between p-values using different methods, different types of genes and all orness. Rows correspond to the comparisons between methods: bc, between-pair vs complete distributions; be, between-pair distribution vs edgeR method; ce, complete distribution vs edgeR methods

OMICfpp reports around 85% fewer genes if a p-value < 0.001 is considered. The same applies if an adjusted p-value < 0.001 for edgeR (14332 significant genes) and DESeq2 (14606 significant genes) is considered. Thus, our method is more restrictive than edgeR or DESeq2. Thus our method is more restrictive than edgeR or DESeq2. Furthermore, 914 genes are reported in common by edgeR, DESeq2 and OMICfpp using an orness of 0.93 i.e. 54% more that when an orness of 0.37 is used. Moreover, 95.4% of the genes reported using an orness of 0.93 were also reported by the other methods, while in the case of orness of 0.37 only 35% of the genes were reported by the other methods.

The first 30 genes reported by edgeR and DESeq2 are shown in the Tables 4 and 5, respectively. The results obtained using edgeR, show that 66.6% and 60% of the genes were previously reported in CRC and in another type of cancer, respectively. In addition, the bold entries show that 76.67% of the first 30 genes were also reported by the other methods (Table 4). For the DESeq2 results, 80% and 83.3% of the genes were previously reported in CRC and in another type of cancer, respectively, and 70% of the genes are also reported in the other methods (Table 5).

Table 4 The top 30 genes with differential expression reported using edgeR method
Table 5 The top 30 genes with differential expression reported using DESeq2 method

Simulation study

In order to obtain a more complete evaluation of the OMICfpp method, a simulation study was performed, using Poisson counts (Fig. 4b) and negative binomial distributions approach (Fig. 4c).

In the simulation study using Poisson counts, we consider three types of features (genes for instance): significant genes, non significant genes and outliers genes. We have to simulate random pairs of counts for the three types. We consider four Poisson random variables such that the i-th variable Xi follows a Poisson distribution with mean λi for i=1,…,4. If the gene is non significant then we simulate the random vector (X1,X3)=(x1,x3). The pair of counts for the pair are (x1,x1+x3). Note that the mean of X3, λ3, is small i.e. just an small increment of the count. If the gene is significant then we simulate (X1,X2)=(x1,x2) and the counts are (x1,x1+x2). Finally, if the gene is an outlier then we have two types of pairs. The first type of pair is as a pair of a non significant gene. The second type of genes is different. We consider a realization of (X1,X4)=(x1,x4) and the counts are (x1,x1+x4). The mean λ4 is much greater than λ2. The idea is to simulate genes with no differential expression for the most of the pairs except for a few ones. We call them outlier genes. This model is implemented in the function rPairedPoisson of the package OMICfpp. We have used 1000 genes with 50 significant, 50 outliers and 900 non significant genes. For each outlier gene we have used 1,2,…,10 outlier pairs for an outlier gene. The orness values used goes from 0.01 to 0.99 with a step of 0.02. The mean of the random variables will be λ1=10 and λ3=2. The mean λ2∈{6,10,14,18} and λ4 goes from 30 to 60 with an step of 2.

Figure 4b display a simple graphical description. We have three types of genes: outliers, significant and non significant genes. We pretend to compare three methods, between-pair and complete randomization p-values and edgeR. These p-values have been estimated for each simulated data set. The mean of the differences between each pair of methods has been calculated and displayed in this figure by taking into account the value of λ4 i.e. more extreme outliers and for different values of λ2 i.e. more clearly differentiated significant genes. Our p-values are sensible to the outliers (first column) and are similar to the results of edgeR when for λ2 equal to 10, 14 and 18. However, edgeR can detect a difference of λ2=6 and our methods can not detect it. The non-signficant genes are equally non-detected by all methods. Again, edgeR is more powerful but very sensitive to the outliers. Our methods are not so powerful but they detect the outliers and are not so sensible to them.

A similar model has been performed by replacing the Poisson distribution with the negative binomial distribution (Fig. 4c). Now, the means of the four negative distribution used are μ1=10, μ3=2 and μ2 takes the values 20, 30, 40 and 50. The values for μ4 goes from 60 to 100 with a step of 5. The dispersion parameter used for all negative distributions has been 1/10. We think the comments given using Poisson distributions can be applied to the study using negative binomial distributions.

It is important to note that the method DESeq2 can not be applied to this simulated data because it probably needs a greater over dispersion in the data. We have had problems with the estimation of the prior distributions. For this reason, we compare our methodology just with the method edgeR.

Gene expression signatures for colorectal cancer

A total of 491 genes were reported in common for all methods (Fig. 4a), of these 65 genes are within the top 30 previously described (see Tables 2, 3, 4 and 5). These genes are studied in more detail, in order to propose a gene expression signatures for colorectal cancer. A total of 36 genes in common have been previously reported in CRC: ANOS1, CNTN1, SLC11A1, IBSP, MUSK, USP2, GUCA2B, TNFRSF17, COL11A1, IL11, C2orf40, PRPH, TMEFF2, KRT80, UGP2, SLC30A10, GSTM5, ESM1, CA7, WNT2, FEZF1-AS1, AJUBA, CEMIP, CLDN1, SPERT, FOXQ1, INHBA, EPHX4, KLK6, TRIB3, CST1, SPIB, GRIN2D, GCNT2, COL10A1 and MMP7. Furthermore, a total of 29 genes in common have not been previously reported in CRC: LAS1L, ICA1, TNFRSF12A, GTF2IRD1, ZNF200, MYOC, NEXMIF, POLR3E, LGI1, PCSK2, PRELID3A, BEST4, CTNND2, SLC6A19, OTOP2, IGLV2-14, AC016027.1, LINC02577, PCAT2, AC055717.2, DAO, TMEM82, SPATA12, LINC02163, OTOP3, LINC02414, AC07099.1, CPNE7 and LINC02418 (see Tables 2, 3, 4 and 5). The randomization p-value profiles are shown in Fig. 2. In the case of validated genes that are shown in the Fig. 2, all have a defined randomization p-value profiles. The same happens with the profiles of the genes not reported in the literature, with the exception of SPATA12. Probably SPATA12 corresponds to a false positive gene, despite being reported by all methods, since this gene is expressed primarily in testis and may play a role in testicular development and spermatogenesis. On the other hand, little or nothing is known about the AC016027.1, AC055717.2 and AC07099.1 genes, which makes their description difficult. This is the same situation with the coding genes of ncRNA, LINC02577, LINC02163, LINC02414, LINC02418 and the prostate cancer associated transcript 2 (PCAT2) gene. The neurite extension and migration factor (NEXMIF) gene has not been previously reported in cancer, thus its function is unknown. The same goes for the RNA polymerase III subunit E (POLR3E) gene, the PRELI domain containing 3A (PRELID3A) gene, the otopetrin 2 (OTOP2) and 3 (OTOP3) genes, the immunoglobulin lambda variable 2-14 (IGLV2-14) gene and the bestrophin 4 (BEST4) gene that encodes a transmembrane proteins. The LAS1L gene is essential for cell proliferation and also for biogenesis of the 60S ribosomal subunit [32] and has been previously related with pancreatic cancer [33]. The ICA1 gene encodes a protein with an arfaptin homology domain that is found both in the cytosol and as membrane-bound form on the Golgi complex and immature secretory granules. This protein binds to the small GTPase Rab2, thus it can be related to cancer [34]. The TNF receptor superfamily member 12A (TNFRSF12A) gene is well known in cancer, for example, it is linked to poor prognosis in breast cancer [35]. The GTF2IRD1 gene encodes a transcription factor protein, that are related to tumor-promotion [36]. The zinc finger protein 200 (ZNF200) is a little known gene and in cancer, only variants associated to ovarian cancer have been previously reported [37]. The MYOC gene encodes the protein myocilin, which is believed to have a role in cytoskeletal function and it has been previously reported in glaucoma [38]. Thus, this gene has less likely to be validated experimentally in CRC. However, the rest of the genes (LGI1, PCSK2, CTNND2, SLC6A19, DAO, TMEM82 and CPNE7) could be related to CRC and have been previously identified in other types of cancer. Thus, we propose these 20 genes as new candidate genes.

Discussion

We develop OMICfpp as a method for statistical analysis of RNA-Seq data with a paired design and small sample size context. OMICfpp, through the orness election allows to the user assign weight to the results, based on each biological context. However, we also provide the alternative of automatic orness selection. Here we use colorectal cancer data, but OMICfpp can be applied to all kinds of biological problems that involve RNA-Seq analysis.

We use the chooseOrness function to select an orness value of 0.37 and 0.93. We also tested the possibility of using a probability distribution over the orness and use the score for CRC data analysis. The results suggest that the use of the score is a more robust method for gene selection, whereas a single orness selection is a reasonable method. Besides, a large number of genes reported in the top position using the score, are also reported within the results obtained by a single orness value.

On the other hand, we tested OMICfpp results using different sample sizes (“OMICfpp in a small sample size context” section). It is clear that a smaller sample size will affect more the highest values of orness. For low orness there is a great overlapping between significant genes using lower sample sizes (Fig. 3). These results confirm that the sample size is very important in obtaining results. We suggest to use p-values < 0.001 as the cut line for the results obtained using OMICfpp in smaller sample sizes.

The results obtained by OMICfpp method were validated through bibliographic review, and also by a simulation study. An important part of the results are in agreement with the cancer bibliography, validating the OMICfpp method. Also, we compare the results of OMICfpp with those obtained by edgeR [9] and DESeq2 [17]. We obtain a considerably smaller number of candidate genes than edgeR and DESeq2 (Fig. 4), indicating that our method is more accurate. In turn, the results using an orness of 0.93 were also supported by edgeR or DESeq2 by more than 90%. In addition, there is an important coincidence between the top 30 genes reported by OMICfpp, edgeR and DESeq2 methods.

Besides, the simulation study shows that edgeR is more powerful than our procedure. However, the outliers affects more the results of edgeR than ours. If there is a suspect than the differential expression is due to just one or two outlier pairs, then our approach could complement the study.

Moreover, we identify candidate genes not reported by edgeR and DESeq2 methods, which we suggest must be validated. Furthermore, 491 genes are reported by all compared methods (Fig. 4a). Of these, 65 genes are in the top result in all methods and 36 genes have been previously reported in the bibliography as differentially expressed in colorectal cancer (Tables 2, 3, 4 and 4. All of these with a well defined randomization p-value profile. Thus, we deepened in the study of the remaining 29 genes, using the biological data obtained in the bibliography and biological data bases, and our randomization p-value profiles (Fig. 2). Therefore, we propose the use of randomization p-values profiles as an accurate method to select the candidate genes for experimental validation.

Furthermore, in the last 20 years, it has been searched to identify “cancer signature” in terms of diagnosis, prognosis or prediction of therapeutic response [39]. Although the term refers to one or more genes, a biomarker panel with a growing number of genes is currently used. In this sense, we recommend the experimental validation of LAS1L, ICA1, TNFRSF12A, GTF2IRD1, ZNF200, NEXMIF, POLR3E, LGI1, PCSK2, PRELID3A, BEST4, CTNND2, SLC6A19, OTOP2, IGLV2-14, PCAT2, DAO, TMEM82, OTOP3 and CPNE7 genes as new targets for gene expression signature in colorectal cancer.

Conclusions

RNA-Seq is a powerful method to study the complexity of the transcriptome, however there are many challenges to solve. On the one hand, the inclusion of the experimental design in the analysis of the results can contribute to the obtaining of more precise results. In this regard, OMICfpp is an accurate method for differential expression analysis in RNA-Seq data with paired design. On the other hand, a large number of genes identified as differentially expressed in silico are not experimentally validated. In this sense, we propose the use of randomized p-values profile graphic as a powerful and robust method to select the target genes for experimental validation.