Background

MicroRNAs (miRNAs) are endogenous ~22 nucleotide RNA molecules that act as fundamental repressors of gene expression in many biological systems. In animals, they target mRNAs by recognizing and directly binding to multiple partially complementary sites preferentially located in the 3' untranslated regions (UTRs) of transcripts. Watson-Crick base-pairing to the 5' end of miRNAs, especially to the so-called 'seed' region that comprises nucleotides 2-7, is considered crucial for targeting [1, 2], even if recently developed techniques for ligation and sequencing of miRNA-target RNA duplexes highlight widespread noncanonical seed interactions, containing bulged or mismatched nucleotides [3]. Although molecular mechanisms of miRNA action remain intensely debated [4], multiple studies revealed that mammalian miRNAs repress genes predominantly by destabilization of target mRNAs [5, 6]. By computational and experimental approaches it was established that thousands of human protein-coding genes are regulated by miRNAs [7, 8]. Given the wide scope of their targeting, miRNAs are considered as an additional layer of regulatory circuitry in the cell. Experimental observations suggest that miRNAs are regulators of development and cellular homeostasis through their control of diverse biological processes, from differentiation and proliferation to apoptosis [9]. Their role in regulating fundamental cell mechanisms suggests that they could be involved in cancer and indeed their expression is strongly deregulated in almost all human malignancies. Functional characterization of these aberrantly expressed microRNAs indicates that they might function as oncogenes and tumor suppressors [10, 11].

Identifying key miRNAs contributing to the genesis and development of a particular disease is a focus of many recent studies. Statistical methodology for this task is not fully established due to the mild effect of miRNAs on the expression of their targets. A major source of information to infer the actual regulatory activity of miRNAs derives from high-throughput experimental data such as transcriptome profiles. The underlying assumption is that regulatory activity by miRNAs could be reflected by the expression changes of their target transcripts. For miRNAs that promote mRNA decay, there would be a negative correlation between miRNA and mRNA expression. Existing tools based on this assumption mainly rely on case-control mRNA profile experiments involving strong perturbations such as the knockout/knockdown/overexpression of one or few miRNAs [1214]. In data originating from less controlled conditions, such as mRNA profiles of pathological tissue collected from patients, detecting miRNA-mediated target destabilization is more challenging due to presence of multiple cell types in samples, the activity of additional regulatory factors and complex RNA cross-regulation such as the miRNA sponge effect [1517]. The recent study of a large cohort of breast tumour samples [18] suggests that miRNAs exert their effect by modulating mRNAs rather than acting as on-off switches. Other studies inferring miRNA regulation on tumour sample transcriptome exploit additional molecular information such as AGO2-PAR-CLIP binding-site data [19] or DNA copy number and promoter methylation at the mRNA gene locus [20].

We introduce here a rank-based method to detect miRNA regulatory activity combining three sources of information, namely measurements of gene and miRNA expression levels from the same biological samples and sequence-based target predictions. Rank-based approaches such as Gene Set Enrichment Analysis (GSEA) [21] are designed to detect modest but coordinated changes in the expression of sets of functionally related genes. This is particularly suitable to infer miRNA regulatory effect from tissue expression profiles, in which this effect is subtle at the level of individual genes but affects a large number of genes. The original GSEA algorithm ranks all genes based on the correlation of their expression with a phenotype of interest and looks for predefined groups of functionally related genes that are enriched at either the top or bottom of the ranked list. We propose here a new scoring scheme in which the enrichment profile is based on both the correlation between gene and miRNA expression levels and the confidence of sequence-based target prediction. The defined enrichment score for a given miRNA is expected to be high if most of its predicted targets are at the top or at the bottom of the ranked list. The significance of the enrichment score is evaluated by a permutation procedure. As final result we obtain miRNAs showing a statistically significant enrichment score, which we consider as potential regulators in the analyzed conditions. The analysis pipeline is summarized in Figure 1. It has been implemented as a freely available R script (code available at http://bioinfo-out.curie.fr/projects/targetrunningsum).

Figure 1
figure 1

Schematic summary of the pipeline.

We applied our method to elucidate the regulatory effect of miRNAs on the breast cancer transcriptome. Breast cancer is classified into various subtypes mainly based on the immunohistochemical staining of estrogen (ER), progesterone (PR) and HER-2 (ERBB2) receptors. The complex nature and heterogeneity of this disease, particularly with regard to gene expression profiles, make it difficult to detect the shaping effect of miRNAs on the transcriptome. We applied our algorithm to a breast carcinoma dataset including gene and miRNA expression from normal and breast tumour samples (which we refer to as Maire dataset [22, 23]) and we show that it is able to identify miRNAs with statistically significant enrichment score. These results are then compared with those obtained on an independent dataset of normal and breast tumour samples from The Cancer Genome Atlas (TCGA) [24]. We further focus on the triple negative breast cancer (TNBC) to highlight miRNAs potentially relevant in this particular tumour subgroup. This subtype is intensively studied due to the lack of effective targeted therapies. We ran our algorithm including only samples characterized as triple-negative breast tumours and identified a set of miRNAs showing statistically significant signal in both Maire and TCGA datasets.

Finally, we investigated miRNAs identified as potential regulators to characterize the function of their targets. In the proposed algorithm, hundreds of genes account for the enrichment signal of a single miRNA and we expect a subset of them to participate in common cellular functions. We use multiple annotation databases to infer biological processes affected by the identified miRNAs. For those identified in both datasets, the corresponding sets of target genes were analyzed separately. We observe that even if the specific genes accounting for the enrichment of biological categories among miRNA target genes detected in the Maire dataset and in the TCGA datasets are not necessarily the same, they are associated to common cancer-related pathways

Results

Functional miRNAs in breast cancer

The analyzed dataset includes 129 tumour and healthy breast tissues for which both miRNA and mRNA expression profiles are available (see Material and Methods). In this study, we used sequence-based target sets obtained from TargetScan version 6.2 [8, 25], a widely used algorithm which takes into account sequences that match the seed region of each miRNA and evaluate their conservation in several species. The confidence of target predictions is calculated as described in Methods. We include in our study 394 miRNAs for which both expression data and TargetScan predictions are available.

We identified 136 miRNAs as potential regulators with FDR < 0.1 (results are reported in Additional File 1, Table S1). Among the top significant miRNAs, we found several ones that are known to function as oncomirs, such as the members of miR 17-92 cluster and its paralog cluster miR-106b-25, miR-15 and miR-16 [26].

To validate these results, we applied our algorithm to an independent dataset of 521 healthy and cancerous breast tissue samples from the TCGA project (see Material and Methods). This study includes 260 miRNA for which both expression data and TargetScan predictions are available. Of these, 142 miRNAs are identified as regulators with FDR < 0.1. The intersection between results obtained in both datasets contains 44 miRNAs (see Figure 2a and Table 1) which corresponds to a statistically significant overlap (hypergeometric p − value < 0.05).

Figure 2
figure 2

Venn diagrams showing the overlap between results obtained in Maire cohort and TCGA when using all samples (a) and when restricting the analysis to the TNBC subtype samples (b).

Table 1 Common predicted miRNAs in Maire cohort and TCGA for the analysis including all samples.

For a limited subset of 7 commonly identified miRNAs, the enrichment score obtained for the Maire dataset has opposite sign compared to that obtained for the TCGA dataset. In these cases, predicted targets are enriched among genes whose expression is positively correlated with the expression of the miRNA. The observation of miRNAs positively correlated with their predicted targets is in agreement with analogous integrated analysis of miRNA-mRNA correlation in tissue samples [27]. An interesting hypothesis suggests that this effect can be explained by common transcriptional regulation conferring robustness to gene expression program and ensuring tissue identity. Consistently, architectural features of the mammalian miRNA regulatory network reveal that the coordinated transcriptional regulation of a miRNA and its targets is an abundant motif in gene networks [2830]. In our analysis, we consider both positive and negative correlation of predicted targets as a good evidence to infer miRNA regulation. We investigated whether the correlation sign of the miRNA expression with that of its targets is associated to a different proportion of correlated targets or a different correlation strength. For each significant miRNA we extracted the leading-edge subset of genes, corresponding to those genes in the set S m that appear in the ranked list before the point where the running sum achieves the ES (see Figure 3). This can be interpreted as the core of a gene set that accounts for the enrichment signal. We plotted the size of the leading-edge subset and the maximum correlation value as function of the correlation sign. Despite the size of the leading-edge subset and its maximum correlation value do not differ significantly according to the sign of the correlation, results show a trend toward stronger correlations of negatively correlated leading targets (Additional File 2, Figure S1).

Figure 3
figure 3

An illustration of statistically significant enrichment score and selection of the leading-edge targets tested for functional enrichment. For a given miRNA, the red line in the plot corresponds to the running sum profile obtained for the ranked list of predicted targets while the gray lines correspond to the profiles obtained when permuting the original list. Orange dashed lines indicate levels of significance with p <0.1, p <0.05 and p <0.01.

Triple negative breast cancer specific study

Of all breast tumours, TNBC is a very malignant subtype with poorly characterized molecular pathogenesis [31]. To elucidate the role of miRNA regulation in this specific cancer subtype, we applied our algorithm to the Maire dataset including only TNBC samples (n = 37). Similarly, we performed the analysis on the TCGA dataset selecting only TNBC samples (n = 82). Tumors were assigned to this subgroup according to ER, PR and HER2 negative status. Using TargetScan predictions, the algorithm identified 166 significant miRNAs in the Maire dataset (89 of them included in the common universe of miRNAs analyzed in both datasets) and 75 (64) in the TCGA dataset with FDR < 0.1. The agreement between results obtained in the two datasets is highly significant (41 commonly identified miRNAs, hypergeometric p − value < 104). Results are reported in Figure 2b and Table 2.

Table 2 Common predicted miRNAs in Maire cohort and TCGA for the TNBC specific analysis.

A group of 20 miRNAs is identified specifically in the TNBC study. Remarkably, some of these are already associated with aggressive breast cancer: miR-29a, miR-29c and miR-148a have been shown to be downregulated and associated to aberrant hypermethylation in basal-like cell line [32], miR-27a is involved in endothelial differentiation of breast cancer in a basal-like cell line [33] and in the MDA-MB-231 basal-like cell line [34] and miR-139-5p is described as a regulator of breast cancer cell motility and invasion [35].

Identification of biological processes targeted by miRNAs

To assess the biological relevance of miRNAs identified as potential regulators, we investigated whether the genes that account for the enrichment signal of a given miRNA participate in the same cellular process or signalling pathway. The assumption that some miRNAs downregulate a group of genes participating in the same pathway is supported by multiple experimental studies [3638]. Based on this hypothesis, for each significant miRNA we tested the leading-edge subset of genes for functional enrichment using curated annotation databases, such as Gene Ontology [39], KEGG [40], BioCarta [41], Reactome [42] and ACSN [43].

For miRNAs identified in both Maire and TCGA datasets, the corresponding subsets of leading-edge targets were analyzed separately. Interestingly, these subsets of genes display highly significant overlap and similar enriched categories, supporting the relevance of miRNA regulatory role in breast cancer. We report in Additional File 3, Table S2 the complete list of enriched categories with Bonferroni corrected p-values below 102. The most enriched categories include breast cancer related processes like cell substrate adherens junction, regulation of cell migration, nuclear pore complex and integrin pathway.

Conclusions

High-throughput mRNA and miRNA expression data from large cohorts of normal and pathological tissue samples can be exploited to detect miRNA regulatory activity. The rank-based algorithm introduced here is able to detect miRNA-mediated target destabilization from normal and breast cancer expression profiles. Reproducible results were obtained in two independent datasets, providing a list of miRNAs potentially relevant in breast cancer. Moreover, the association of these miRNAs to cancer related processes is supported by functional enrichment of affected target genes.

The fact that a better overlap was obtained between Maire cohort and TCGA when restricting the analysis to the TNBC subtype compared to when using all samples may at first seem counterintuitive since using more samples should allow better power to detect correlations. However, this observation may be explained by the fact that when considering all breast cancer subtypes and healthy samples together, a larger part of the variation in the transcriptome data will arise from factors that are not directly linked to miRNA activity. For example, when putting together data from the luminal and non luminal subtypes, much of the variation will be associated with the status of the estrogen receptor pathway. Such variation can induce an important correlation structure in the data that may confuse the detection of the much subtler variation associated with miRNA regulation. The proposed algorithm can be considered a suitable tool to elucidate the regulatory role of miRNAs in physiological conditions.

Methods

Computational framework

The proposed algorithm requires as input genome-wide miRNA and gene expression data from the same biological samples and sequence-based predicted miRNA target sets. A three step procedure is applied to each miRNA m:

  1. 1.

    All genes are ranked according to the correlation between gene expression and the expression of miRNA m

  2. 2.

    The enrichment score defined in Equation (1) is computed for the sequence-based target set S m associated to the miRNA m

  3. 3.

    The significance of the enrichment score ES(S m ) is evaluated by a permutation procedure estimating an empirical p-value PV(ES(S m ))

The ranking scheme and the enrichment score definition are described as follows.

Let G = {g1, g2, ...g N } denote the list of all genes included in a genome-wide transcriptome experiment. For a given miRNA m, we sort this list according to the non-parametric (Spearman) correlation between gene expression and the expression of the miRNA m and get a ranked gene list G m = {gm1, gm2, ...g mN }. Given the sequence-based target set S m G associated with the miRNA m and the corresponding prediction confidence weights W m , we define the enrichment score as the running sum's maximal deviation from zero over all genes:

E S S m = max 1 i N j = 1 i s m j r m j α - < s m r m α w i t h W m = s m j for g m j S m 0 e l s e w h e r e
(1)

where r mj is the correlation between the expression of gene j and the expression of miRNA m, < s m · |r m |α > is the average of prediction confidence-weighted correlations for the set S m . The parameter α controls the contribution of the correlation r mj such that correlation values can contribute linearly (α = 1) or non-linearly in the running sum. Similar to the original GSEA algorithm, in this study, we set α = 1. According to this equation, the running sum is incremented by value (s mj · |r mj |α− < s m · |r m |α >) when encountering a gene in S m and decreased by < s m · |r m |α > when not.

The statistical significance of the enrichment score ES(S m ) is assessed by a permutation based p-value. The enrichment score of the randomly shuffled list G m is computed for N p =1000 permutation rounds and the empirical null distribution of ES(S m ) is generated. An empirical p-value PV(ES(S m )) is estimated for the positive and the negative region of the distribution by the proportion of permutations that result in larger ES(S m ) than originally observed (or lower ES(S m ) for the negative region). Once PV(ES(S m )) for a miRNA m is below a fixed threshold, the regulatory activity of miRNA m on the transcriptome is inferred. The PV threshold is set according to False Dicovery Rate (FDR) obtained by the Benjamini-Hochberg method [44]. In our study, the PV threshold was set according to FDR < 0.1.

Tissue collection

Healthy samples from mammary plastic surgery and tumor samples were obtained from patients treated at the hospital of Institut Curie (Biological Resource Center, Paris, France) as described previously [22, 23]. Experiments were performed in agreement with the Bioethic Law No. 2004-800 and the Ethic Charter from the French National Institute of Cancer (INCa), and after approval of the ethics committee of our Institution. From the initial dataset, we retained only the samples for which both mRNA and miRNA data were available, numbering to 129, including 11 healthy breast tissue samples, 37 TNBC, 28 ER-/HER2+, 24 Luminal A and 29 Luminal B breast tumour samples as characterized by immunohistochemical staining.

MiRNA expression data

Samples were hybridized on the Agilent miRNA microarray kit (V3). One hundred ng of total RNAs was hybridized to the microarrays according to the manufacturer's instructions. Hybridized microarrays were scanned with a DNA microarray scanner (Agilent G2565BA) and features were extracted using the Agilent Feature Extraction image analysis tool with default protocols and settings. Data were first transformed using the reverse hyperbolic sine function and quantile normalized. The data were then corrected for a hybridization batch effect using a linear model including the hybridization series as a fixed effect. Next, probes with negative intensity values in 95% or more of the arrays were discarded, leaving 516 miRNAs for analysis. When technical replicates for a sample were present, they were subsequently averaged. Two samples displayed an outlier behavior evident from Principal Component Analysis (data not shown) and were discarded from this analysis. The dataset can be downloaded here at the following address: http://bioinfo-out.curie.fr/projects/targetrunningsum.

Gene expression data

For the protein-coding transcriptome, the data from Affymetrix U133plus2 arrays was processed as described [22]. In summary, we used the brainarray HGU133Plus2-Hs-Entrez version 13 custom chipset definition file [45], data were then normalized with GC-RMA [46], technical batch artefacts were corrected using a linear model, and genes with noise-level expression in 95% or more arrays were filtered out leaving 11543 probesets each corresponding to a unique Entrez Identifier.

MiRNA and gene expression data from TCGA

We conducted our study on the publicly available data of tumour and healthy breast tissues from TCGA described in [24]. We selected 500 tumors and 21 tumor-adjacent normal breast tissue samples for which both mRNA and miRNA data were available. Of the 500 tumors, 82 were assigned to the TNBC subtype according to ER, PR and HER2 negative status. In this dataset mRNA expression levels were determined by Agilent custom 244K whole genome microarrays and miRNA abundance was measured by Illumina sequencing technology. Level 3 released data contain quality controlled and processed data done by Broad Institute's TCGA workgroup with expression call for genes per samples. Gene level expression data were normalized by using Robust Multi-array Average (RMA) and expression values were gene centered. MiRNA expression was given as read counts normalized to relative read frequency in each sample. Detected miRNAs were defined as having more than 10 reads in at least 10% of the samples, leaving 332 miRNAs for analysis.

Target predictions by TargetScan

In the TargetScan algorithm, the prediction score of a miRNA binding site depends on the level of conservation and sequence context criteria such as the distance of the target from the 3'UTR end and the AU composition of the flanking area. For each miRNA, we take as prediction confidence weight of its targets the total context scores generated by the algorithm for 3'UTRs aggregating the binding site scores. By construction, the total context scores range between -1 and 0. When total context score for multiple 3'UTR isoforms are predicted, we take the total context score of the longest 3'UTR isoform.