Background

microRNAs (miRNAs) are important modulators of gene expression in various biological systems, including development [1], carcinogenesis [2] and virus-host crosstalk [3]. Most mRNAs can be repressed by more than one miRNA and conversely, most miRNAs have many known and predicted mRNA targets. This direct impact of miRNAs on a large number of mRNA species makes it a powerful biological regulator and therefore a prime candidate for studying diseases such as cancer, where miRNAs can act either as oncogenes (oncomiRs) or tumour-suppressors [4]. Gene repression by miRNAs is generally achieved by pairing between the 5’ end of miRNAs, from the second to the seventh nucleotide (called the seed region), and the 3’ untranslated region (UTR) of the gene target [5]. The strong effects of miRNAs on mRNAs regulation could, in theory, mean that specific miRNA-signatures are linked to certain gene expression patterns. Several databases predicting miRNA’s targets based on various algorithms have been launched. Although the lack of specificity in miRNA’s target predictions is a known fact [6], by taking advantage of these existing miRNA-target databases, numerous programs succeeded in highlighting the most relevant miRNAs affecting a gene expression pattern [710]. Indeed, since a miRNA can target several mRNAs, it is possible to compute the statistical significance of the overrepresentation of miRNA’s targets within a gene-set. Most of the aforementioned programs use a standard hypergeometric probability (HP) to predict miRNAs involved in a biological process based on a compendium of miRNA-target databases. This strategy is in line with other tools used to compute the significance of functional annotations from gene-lists. However, when applied to searching miRNA-signatures from gene-lists, there is a considerable overlapping of gene targets between miRNAs leading to a number of significant signatures, including numerous potential false positive predictions [6, 11]. The source of false positives may result from miRNAs sharing a similar seed region. Consequently, it is important that programs and algorithms used are able to highlight the true positives. In contrast to current methods based on HP only, we introduce a novel strategy in complement to HP, which (i) ’weigh-down’ the contribution from overlapping target genes when calculating the significance of each miRNA-signature using an expectation-maximization (EM) algorithm, a general probabilistic framework that can be used for this purpose [12]; and (ii) cluster all predicted miRNAs according to their seed region sequences for identifying “synonymous” predictions. To increase the specificity of our prediction, we also build a large compendium of miRNA’s target predictions based on the most used databases. To our knowledge, the application of EM-algorithm as a probability measurement for significance in functional annotation tools has yet to be explored.

Implementation

miREM workflow

The miREM workflow is composed of the following five steps (Fig. 1):

Fig. 1
figure 1

miREM Workflow. a The gene-list and setup parameters are entered into the input page. b Each transcript is associated to its targeted miRNA(s) using the selected prediction databases. miRNAs are selected only if their respective HP p-values reach a pre-defined enrichment level. Then, these selected miRNAs are subjected to the EM-algorithm to establish the likelihood probability of each miRNA. miRNAs with the highest likelihood probabilities are the most likely to have an influence on the DEG. Afterwards, predicted miRNAs are clustered according to their seed region sequences in order to identify duplicated predictions. c Subsequently, results are displayed in a dynamic graphical interface allowing an easy data interpretation. d Finally, a dendrogram of miRNA seed sequences is generated to help in identifying duplicated predictions (miRNAs sharing similar sequences)

  1. 1

    miREM takes a list of differentially expressed genes (DEG) derived from either RNA-seq or the latest microarray platforms as input. Since the input DEG list is obtained from whole transcriptome level, miREM is inapplicable to the results derived from targeted sequencing technologies or microarray platforms targeting a specific gene panel. After receiving a gene list, miREM associates each transcript to its targeted miRNA(s) using the selected prediction databases, thus to create a list of potentially repressive miRNAs.

  2. 2

    The hypergeometric p-value (and corrected p-value according to Benjamini-Hochberg) is determined for each unique miRNA so as to identify its enrichment significance.

  3. 3

    If only one miRNA is found to have a significant p-value, the program stops and identifies only this significant miRNA as having an influence on the DEG.

  4. 4

    If more than one miRNA can be identified, the program then selects miRNAs with corrected p-values below the specified threshold and subjects them to the EM-algorithm to establish the likelihood probability of each miRNA. miRNAs with the highest likelihood probabilities are most likely to have an influence on the DEG. Along with tab-delimited files, miREM results are available through a clustered heatmap of miRNA-gene interactions to ease visualization of the genes targeted by predicted miRNAs. This visualization also enables users to intuitively infer the kind of co-repression activity these predicted miRNAs play in the system.

  5. 5

    Finally, predicted miRNAs are clustered according to their seed sequences in order to identify duplicated predictions (miRNAs sharing similar sequences).

This unique workflow, coupled with a rich combination of features (Table 1), makes miREM a better alternative when compared to existing softwares.

Table 1 Feature comparisons of five miRNA predicting tools

miREM’s miRNA-target interactions database

Multiple features characterized in the miRNA-target interaction mechanism have been exploited for developing strategies to predict miRNA target genes, including (i) the seed complementarity between miRNA and mRNA strands; (ii) the free energy of the miRNA:mRNA duplex; (iii) the target site accessibility; (iv) the contribution of multiple binding sites and (v) the evolutionary conservation [13]. Many databases exploit one or several prediction strategies to provide miRNA-mRNA interaction information. In order to build a comprehensive resource to predict a miRNA-signature, we compiled those up-to-date information pooled from the latest releases of Diana [14], Miranda [15], mirDB [16], Pictar [17], PITA [18], RNA22 [19] and TargetScan [20] (Additional file 1: Table S1).

Querying miREM’s database

miREM accepts Refseq, Ensembl, UCSC IDs or official gene names as input and uses Ensembl Biomart [21] to unify gene IDs. The miREM analysis interface allows interrogation of the reference databases in a flexible way, enabling users to select one or more preferred reference databases. If more than one database is selected, the user can opt to use only the miRNA-target interactions that are found to be common to all (intersections). Alternatively, users can select miRNA-target interactions that appear in one (union of all databases) or more of the seven databases (intersections from 2 to 7 -all- reference databases). Moreover, for Targetscan and Miranda reference databases, queries can be restricted to evolutionary conserved miRNAs. Reference databases are complementary as a large portion of predicted targets are unique to each database (Additional file 2: Figure S1). The flexibility in querying reference databases provides users a leverage to adjust for prediction specificity and sensitivity.

EM algorithm formulation

Assuming we have N genes and K miRNAs, let us define an N×K matrix Z, where zik=1 if gene i is repressed by miRNA k; otherwise, zik=0. We are interested in estimating the proportion of genes repressed by the kth miRNA, pk,k=1,…,K. By observing matrix Z, in other words, we are certain which of the genes are repressed for each miRNA, then the probabilities pk’s can be estimated by counting the number of genes repressed by the miRNA and dividing this by the total number of genes,

$$\begin{array}{*{20}l} \hat{p_{k}} = \frac{\sum^{N}_{i=1}z_{ik}}{N} \end{array} $$
(1)

However, in reality, we do not directly observe Z. Instead, through miRNA-target prediction databases, we observe matrix Y=(yik),i=1,…,N;k=1,…,K, where yik=1 if the kth miRNA is predicted to target the ith gene, otherwise yik=0. We indicate the parameter of interest θ as (p1,…,pK). The likelihood of the complete data (Y,Z) can be written as,

$$\begin{array}{*{20}l} L(\theta|Y,Z)= \prod_{i=1}^{N}\prod_{k=1}^{K} \left(p_{k}^{z_{ik}}(1-p_{k})^{1-z_{ik}}\right)^{y_{ik}} \end{array} $$
(2)

Note that the above likelihood function implicates a logical assumption that P(zik=0∣yik=0), i.e., gene i cannot be targeted by miRNA k if the prediction database does not predict this.

The log-likelihood of the complete data is taken as the logarithm of the likelihood and is given by,

$$\begin{array}{*{20}l} \ell(\theta|Y,Z)= \sum_{i=1}^{N}\sum_{k=1}^{K} y_{ik}\left[z_{ik} \log p_{k} + (1-z_{ik})\log(1-p_{k})\right] \end{array} $$
(3)

To estimate θ, we will use EM algorithm that consists of two iterative steps: an E-step where we evaluate the expected value of complete data log-likelihood, Q(θ|Y,Z)=E[ (θ|Y,Z)]; and an M-step where we update the parameter estimates using a current estimate Q(θ|Y,Z). The algorithm starts from an initial estimate \(\theta ^{(0)}=\left (p_{k}^{(0)},k=1,\dots,K\right)\), where \(p_{k}^{(0)}=\frac {1}{K}\). In iteration m, we update θ(m) in two steps:

  • E-step: We update the current estimate of matrix Z as,

    $$\begin{array}{*{20}l} \hat z_{ik}^{(m)} = E\left[z_{ik} | Y_{i},\theta^{(m-1)}\right] = \Pr\left(z_{ik}=1|Y_{i},\theta^{(m-1)}\right) \end{array} $$
    (4)
    $$\begin{array}{*{20}l} = \frac{y_{ik} p_{k}^{(m-1)}}{\sum_{k=1}^{K} y_{ik} p_{k}^{(m-1)} }, \forall i,k \end{array} $$
    (5)
  • M-step: We update the estimate of θ, \(\hat \theta ^{m}\) by updating each of its component,

    $$\begin{array}{*{20}l} \hat p_{k}^{(m)} = \frac{\sum_{i=1}^{N} \hat z_{ik}^{m} }{N}, \forall k \end{array} $$
    (6)

Intuitively, in the E-step, each kth miRNA with yik=1 is assigned a fraction of gene i in proportion to \(p_{k}^{(m-1)}\), and this fraction is \(\hat z_{ik}^{(m)}\) while the M-step updates the probability of each miRNA by averaging the number of genes that are predicted to be repressed by the miRNAs. The algorithm reaches convergence in a few iterations and return the parameter estimates \(\hat p_{k}\), which is used to calculate the EM scores.

Using HP as a filtering step for EM-algorithm inputs

A non-negligible constraint of the EM algorithm is the running-time performance. Indeed, this algorithm is based on multiple iterations with a linear complexity. Therefore, its performance decreases with increments in the input data. Hence, we attempted to limit the number of miRNAs involved in the computation to those that are more likely to be truly significant. To achieve this, we have implemented HP as an initial filtering criterion. Here, the HP is applied to each miRNA, in order to test whether the number of predicted miRNA’s targets are over-represented within the gene-set. Based on the user HP p-value threshold, miREM selects only significant miRNAs and proceeds to the EM step. Where only one miRNA signature is significantly predicted by the HP, no EM is involved.

Classification of miRNA predictions

miRNAs that share similar seed regions are likely to target similar genes. As such, these miRNAs are likely to be co-predicted. In order to identify duplicated predictions, we have implemented a module to cluster miRNAs according to their levels of homology. This classification is done by multiple alignments of miRNA seed sequences using Muscle [22], followed by the generation of a dendrogram computed by PhyML [23] and displayed by jsPhyloSVG [24].

Availability and requirements

miREM webportal is freely available and can be accessed online at https://bioinfo-csi.nus.edu.sg/mirem2/.

Results

We have developed miREM, an HP-EM-based program designed to predict miRNA activities from a gene list. miREM’s web server incorporates a large compendium of human/mouse miRNA-target prediction databases and provides rich output results facilitating prioritization and interpretation of predicted results.

To test miREM performance, we benchmarked miREM predictions against CORNA [7], GeneSet2MiRNA [8], ChemiRs [9], and Sylamer [10] results using several datasets with known miRNA activities. These are detailed in three case studies as follows:

Case study 1: knock-in miRNA experiments

We used two RNAseq expression datasets from miR-155 and miR-1 knock-in experiments in U2OS cells, respectively [25]. In these experiments, we used a gene-set of repressed genes as input (Additional file 3: Table S2) and ran miREM, CORNA, GeneSet2MiRNA and ChemiRs (Table 2 and Additional file 4: Table S3; for Sylamer, whole gene list ranked by fold change was input). miREM has predicted involving miRNAs correctly, with hsa-miR-155-5p and hsa-miR-1-3p ranked at the first and third positions respectively. Similarly, other four tools, namely CORNA, GeneSet2MiRNA, ChemiRs and Sylamer, showed satisfactory predicting performances whereby both miRNAs involved in the experiments were accurately identified (Table 2 and Additional file 4: Table S3).

Table 2 Performance of five miRNA prediction tools using two single miRNA knock-in and one miR-144/451 double knock-out experiments

Case study 2: double knock-out miRNA experiment

In addition, we analyzed a microarray dataset derived from a miRNA double knock-out experiment [26]. Up-regulated genes from CD71 +/Ter119 +/FSC high bone marrow cells in miR-144/451 −/−mice in comparison with wild-type controls (Additional file 3: Table S2) were input into miREM, CORNA and GeneSet2MiRNA (ChemiRs was excluded from this test as it provides only human miRNA-target databases; for Sylamer, the input gene list was whole gene list ranked by fold change). miREM ranked mmu-miR-144-3p and mmu-miR-451a in the first and second positions respectively. However, no prediction results were available for CORNA and GeneSet2MiRNA when p-value threshold was set as 0.01; even if less stringent p-value filtering was applied (p-value threshold equals to 0.05), only mmu-miR-144 was predicted for both tools (Table 2 and Additional file 4: Table S3). For Sylamer, only mmu-miR-144 was in the result (Table 2 and Additional file 4: Table S3).

Case study 3: double knock-out miRNA experiment

Finally, we compared miREM’s prediciton performance in a miR-181a1/181b1 double knock-out experiment [27] with other tools. In miREM, loose criterion in HP filtering step (p-value threshold = 0.01) resulted in a long list of miRNA candidates. Hence, a more stringent screening setting was applied in HP step (p-value threshold = 0.0001). Again, miREM is the only tool which is able to predict both involving miRNAs, with mmu-miR-181b-5p and mmu-miR-181a-5p ranked in first and fourth positions out of total four predictions respectively. Nonetheless, only mmu-miR-181b was predicted by GeneSet2MiRNA while no results were obtained for CORNA and Sylamer (Table 3 and Additional file 4: Table S3).

Table 3 Performance of five miRNA prediction tools using a miR-181a1/b1 double knock-out experiment

Impact of miRNA databases vs algorithms

Integration of EM algorithm with HP test contributes to miREM’s better prediction performance. In order to test the algorithm implemented in miREM while ruling out database bias, we compared the performances of miREM and CORNA (the rest of the test programs were web-server based and could not be modified) on miR-144/451 double knock-out assay using same miRNA-mRNA interaction database (miRanda mouse conserved miRNA database August 2010 release). miREM ranked miR-144 and miR-451 as first and second predictions respectively in its result, whereas 17 miRNAs were predicted by CORNA, among which miR-144 was ranked first while miR-451 ranked 5th (Additional file 5: Table S4). Hence, introduction of EM into miRNA prediction allowed us to better rank candidate miRNAs. Furthermore, the prediction result by miREM was relatively robust. We tested miREM’s performances using different HP p-value thresholds and EM convergence parameters given the down-regulated gene list from hsa-miR-155 knock-in experiment. hsa-miR-155-5p remained the first-ranked candidate in various prediction settings (Additional file 6: Table S5).

Conclusion

The combination of HP and EM algorithm coupled with a large miRNA-target compendium of databases makes miREM a tool of choice to predict and prioritize miRNAs from a given gene list. Programs like miREM rely on miRNA databases, which can be a source of bias, particularly for uncommon miRNAs processed by the Ago2 endonuclease [28], an alternative mechanism independent to Dicer. Therefore, there is still room for improving target predictions by a better characerization of miRNA targets. Overall, we have demonstrated that miREM’s prediction performance is either similar or better than existing programs such as CORNA, GeneSet2MiRNA, ChemiRs and Sylamer.

Finally, the versatility of the miREM web server makes it accessible to a large panel of users including non-bioinformaticians, by facilitating result exploration and interpretation through numerous representations and a dynamic graphical interface.