Background

The advent of next-gen sequencing technologies and their application to transcriptomes have shown that the vast majority of the human genome is transcribed [1, 2] and the non-coding RNAs (ncRNAs) represent the largest class of transcripts in the human genome [3, 4]. NcRNAs are categorized into different groups based on length, location, or function: long non-coding RNAs (lncRNAs), microRNAs (miRNAs), small interfering RNAs (siRNAs), small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs), and Piwi-interacting RNAs (piRNAs).

NcRNAs have been implicated in a wide array of cellular processes [2, 5,6,7] and emerging evidence further reinforces that they have crucial functional importance for normal development and disease [8]. For example, lncRNAs, the largest class of ncRNAs, are reported to control nuclear architecture and transcription, modulate mRNA stability, translation, and post-translational modifications [7, 9]. Nevertheless, only a small fraction of ncRNAs have been functionally characterized today, and most ncRNAs’ functions remain unknown. The lack of functional annotation of ncRNAs presents a challenge when an ncRNA set of interest is available and needs to be functionally investigated for further analysis.

Most of the available ncRNAs functional enrichment tools are limited to miRNAs. In the first step of these tools, they make a list of genes that are targeted by at least one of the miRNAs in the input set, which is followed by an enrichment analysis on this target gene set [10,11,12]. The target set is derived from experimentally validated interaction databases or produced by target prediction algorithms. Among them, Corna [10], miRTar [12], and Diana-miRPath v.3 [11] differ from varied features such as the source of the targets or the functional sets on which the analysis is conducted. Since the predicted target interactions might include high false positives and are not context-specific, some methods also take into account the changes in mRNA levels. MiRComb [13] conducts a miRNA-mRNA expression analysis followed by miRNA target prediction on the negatively correlated mRNA targets. miRFA [14] considers both the negatively and positively correlated using TCGA data. miTALOS [15, 16] additionally provides a tissue-specific filtering of the targets.

There is also a limited number of tools that offer functional annotation and enrichment analysis on lncRNA sets. Similar to miRNA methods, these methods first find a set of coding genes that are co-expressed genes with the given lncRNA or the lncRNAs in the collection and conduct analysis on these coding genes [17,18,19]. With regards to other ncRNAs, only a few studies provide analysis for ncRNAs other than lncRNA and miRNA. StarBase v2 first constructs a regulatory network based on experimentally identified RNA binding sites and their interactions; next, they perform functional enrichment on the interacting coding genes of the ncRNAs [20]. Starbase v2 offers analysis on miRNAs, lncRNAs, and the pseudogenes. CircFunBase [21] is not an enrichment tool but provides manually curated functions of circular RNAs that can be used for enrichment analysis.

The available tools are limited to the type of input ncRNA they support and do not take into account genomic neighborhood information. In this work, we present NoRCE (Non-coding RNA Sets Cis Enrichment Tool), which offers broad applicability and functionality for enrichment analysis of all types of ncRNAs sets using genomic proximity. NoRCE first finds nearby coding genes on the genome of the ncRNAs in the input set and uses the functional annotations of this coding gene set to perform functional enrichment on the ncRNA set. The motivation of using coding genes for annotation is based on the evidence presented earlier that genes nearby can be linked functionally. Thevenin et al. [22] show that functionally related coding genes are co-localized on the genome. Engreitz et al. [23] report that both coding and non-coding genes can regulate the expression of neighboring genes on the genome. There are several instances of lncRNAs that influence the nearby genes’ expressions [24,25,26]. For example, Ørom et al. [27] report that the depletion of some ncRNAs led to decreased expression of their neighboring protein-coding genes. Others also support the involvement of lncRNAs in the cis regulation, where both the regulatory ncRNA and the target gene are transcribed from the same or nearby genomic locus [28]. Based on these findings, in this work, we take into account the coding genes nearby to functionally assess a given ncRNA set. The transfer of functional annotation from nearby coding genes has been used in the general genomic interval set enrichment tools [29,30,31,32].

To offer broad functionality and applicability, NoRCE allows several additional features. The identified neighborhood coding gene set can be filtered or expanded with coding genes found to be co-expressed with the input ncRNAs. For this, NoRCE allows users to input their expression data or make use of pre-computed correlation results for The Cancer Genome Atlas (TCGA) project expression data. Since TAD boundaries affect the expression of neighboring genes [33], NoRCE also allows analysis that takes into account the topologically associated domain regions (TAD) boundaries on the genome. NoRCE provides miRNA specific options as well; the user can filter the neighbor set with predicted targets of the input miRNAs. Moreover, the input ncRNA set can be filtered based on ncRNA biotype (such as sense, antisense, lincRNA). NoRCE supports various commonly used statistical tests for enrichment.

In the following sections, we first detail the NoRCE’s capabilities and the technical details. We also exemplify the NoRCE on two different functional analyses. In the first use case, we analyze the set of ncRNAs differentially expressed in brain disorder, while the second one showcases miRNA specific analysis on cancer patient data.

Implementation

Capabilities of NoRCE and workflow are summarized in Fig. 1. For a given set of ncRNAs, NoRCE first recognizes the coding genes close to ncRNA genes on the linear genome. Based on user-specified options, these genes are expanded or filtered using co-expressed genes, target predictions, or using the information on the TAD regions. Once the genes of interest are gathered, several gene enrichment analyses are performed. The details of these steps are provided in the following sections.

Fig. 1
figure 1

The workflow of the NoRCE package

Species supported

NoRCE supports analysis for Homo sapiens, Mus musculus (house mouse), Rattus norvegicus (brown rat), Danio rerio (zebrafish), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (worm) and Saccharomyces cerevisiae (yeast). For Homo sapiens, it handles human hg19 and hg38 assemblies. For the other species, it uses the most recent assembly of the species. Supported assemblies for different species are provided in Additional file 1: Table S1.

Curating the cis coding gene list

NoRCE accepts a set of any type of ncRNAs, \(S = \{r_1, \ldots , r_n\}\). For each ncRNA, \(r_i \in S\), in the input list, NoRCE identifies all proximal protein-coding genes in 1D genome. The proximal genes are considered as those that are within the base-pair limit of the genomic start coordinate of the input gene and/or within the base-pair limit of the genomic end coordinate of the input gene. If the coding gene \(r_i\) is located within the user-specified base-pair limit from the upstream and/or downstream of known transcription start and/or end position of the ncRNA gene, it is designated as a neighboring coding gene of \(r_i\) and added to the coding gene list pool of \(C_i\). The union of the coding genes, constitute the final coding gene set to be tested for functional enrichment, \(C = \cup _{i=1}^n C_i\). The pool of coding genes can be further filtered or expanded based on the additional biological evidence available, detailed in the next sections, with user-selected options. Users can also limit the analysis to the introns or exons of the neighboring coding genes. In that case, NoRCE applies the genomic proximity criterion on the intron or the exon of the genes based on the user’s selection.

Input can be provided to NoRCE in the form of gene symbols, Ensembl genes and transcripts, Entrez IDs, or miRBase IDs. Since no single source contains information on all the transcripts, gene coordinates and their annotations are retrieved from two different databases: ENSEMBL [34] and UCSC [35]. We collect the ENSEMBL data via biomaRt package [36]. Genes are retrieved from UCSC using the rtracklayer package [37].

Incorporating co-expression information

Since coding genes that exhibit high co-expression patterns can hint to functional cooperation, NoRCE enables the user to incorporate co-expressed coding genes into the analysis. If the filtering option is set, each \(C_i\) is filtered such that only the neighboring coding genes that are also co-expressed with \(r_i\) are placed into C. If the expansion option is set, a coding gene is co-expressed with any of the \(r_i \in S\) is added to C.

NoRCE enables the user to conduct the expression analysis with user input expression data. In this case, users are expected to load the expression data in TSV or TXT format; or they can use the SummarizedExperiment object in R. Before the correlation analysis, NoRCE executes a pre-processing step on expression data. The variance of each gene’s expression is calculated, and genes that vary lesser than the user-defined variance cutoff, 0.0025 by default, are excluded from the analysis. NoRCE supports commonly used correlation measures: Pearson Correlation, Kendall Rank Correlation, Spearman's Rank Correlation. The default values for correlation coefficient cutoff is 0.3, for significance p-value, 0.05 and confidence level 0.95. The user can set the correlation and significance cutoffs based on their need.

To assist analysis for cancer, NoRCE also allows using pre-computed co-expressed gene sets for ncRNAs measured in The Cancer Genome Atlas (TCGA) project [38]. Since TCGA contains the expression profiles for miRNA, mRNA, and lncRNA, this examination is limited to only the miRNA and lncRNA inputs. The co-expressed genes are defined using the Pearson correlation coefficient. Users can set the cutoff for the correlation coefficient.

Filtering genes with the TAD boundary information

The gene regulatory interactions are affected by the 3D chromatin structure of the genome [39]. On a single chromosome, chromatin compartmentalizes into sub-domains, named as topologically associating domain (TADs). TAD boundary regions insulate the cis-regulating elements [40]. NoRCE allows filtering based on TADs. If this option is selected, when curating the nearby genes of an ncRNA, NoRCE will only include the coding genes within the same TAD boundary with that of the ncRNA. We compile TAD regions for different cell-lines and species from various sources and made them available for use in conducting the analysis. These data sources and the species for which they are available are provided in Additional file 1: Table S2. NoRCE allows inputting BED formatted TAD boundary files. Thus, the user can conduct this analysis with other available TAD information.

Biotype specific analysis

If the user wants to conduct a biotype specific analysis, NoRCE can select the ncRNAs that are annotated with the given biotypes and use this biotype-filtered subset in the subsequent steps. Also, NoRCE allows extraction of ncRNAs of given biotypes S and performs analysis on the subset of genes that do not contain the genes annotated with given biotypes. NoRCE accepts GTF formatted GENCODE annotation files for biotype analysis.

miRNA target list

For miRNA specific inputs, NoRCE provides additional features. The coding gene set, C, can be restricted to the potential miRNA targets; thus, only neighboring coding genes that are also miRNA potential targets are included. The miRNA target list is curated from various sources. Computationally predicted miRNA-target interactions are obtained from the TargetScan [41] for the species except Rattus norvegicus as it is not available. Target predictions for Rattus norvegicus miRNAs are obtained from the miRmap [42]. No miRNA is reported for Saccharomyces cerevisiae [43]. Thus, NoRCE does not provide any miRNA analysis for Saccharomyces cerevisiae. Table 1 presents the details of the pre-computed target predictions.

Table 1 List of miRNA target prediction algorithms used for each species

Enrichment analysis

Once the coding gene list, C, is curated, NoRCE conducts functional enrichment analysis. NoRCE supports analysis with various functional annotations: gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome pathway, WikiPathways, genes, or GMT formatted integrated pathway dataset. For the annotation, we make use of biannually updated databases in Bioconductor. Gene ontologies and their annotated gene list are provided via GO.db [44] package. To increase the statistical analysis power, only the GO terms with at least 5 annotated protein-coding genes are considered as suggested by [17]. KEGG annotation is performed using KEGG.db [45], for Reactome enrichment analysis, NoRCE utilizes reactome.db [46]. NoRCE employs WikiPathways API to retrieve the pathways and annotated gene list [47].

NoRCE supports commonly used enrichment tests: hypergeometric test, Fisher Exact Test, Binomial Test, \(X^2\) test. We refer readers to [48] for details of the statistical tests. The background gene set is all the genes in the functional annotation source that is selected. NoRCE provides the flexibility of providing a user-defined background gene set.

Presentations of the results

NoRCE provides different ways to export the results. All information in enrichment analysis can be retrieved in a tabular format. Also, users can set the number of top enrichment results to exported, and NoRCE outputs these results based on p-value or p adjusted values in a tabular format. Networks and dot plots can be used to visualize the enrichment results. The dot plot shows the top enriched terms, their p-values (or p adjusted values), and the number of enriched genes in the input neighbor set. In the network representation, the enriched terms are represented with nodes, and the ncRNA and coding transcripts related to the enriched terms are represented with edges. In this graph, the node size is proportional to the node degree. The nodes in the networks are clustered, and a color code distinguishes between node clusters. Modularity clustering is employed to cluster the nodes [49]. For network visualization features, NoRCE makes use of the igraph package [50].

NoRCE also offers specialized visualization options for pathway and GO analysis. GO enrichment results can be illustrated in a directed acyclic graph (GO-DAG). We derive the DAG information through the AmiGO API [51]. In this diagram, nodes are GO-terms, and edges indicate relation types between GO terms. Enriched GO-terms are colored according to their p-values or p-adjusted values. Users can export enriched GO-DAG diagrams in a PNG or SVG format. For pathway enrichment results, KEGG and Reactome enrichment results can be visualized within KEGG and Reactome maps. The enriched terms are marked with color using the KEGG and Reactome APIs, respectively. These visuals are displayed through the browser. In the results sections and the supplementary materials, we provide examples of these visualizations.

Results

To demonstrate how NoRCE could be used to analyze a list of ncRNAs functionally, we apply NoRCE on several problems and multiple independent datasets. We use the default parameter settings in the following analyses unless otherwise stated.

Case study 1: enrichment analysis of the ncRNAs for the psychiatric disorders

In this use case, we demonstrate the functional enrichment analysis of a set of ncRNAs related to brain disorders based on gene expression data measured by Gandal et al. [52]. These ncRNAs exhibit gene- or isoform-level differential expression in at least one of the following disorders: autism spectrum disorder (ASD), schizophrenia (SCZ), and bipolar disorder (BD). In total, the ncRNA gene set contains 1,363 differentially expressed human ncRNAs. We perform GO enrichment for biological processes and pathway enrichment analysis based on pathways provided by Bader Lab [53]. The number of pathways and the different pathway sources included in the Bader Lab set is provided in Additional file 1: Table S3. In these enrichment analyses, the background gene sets are described as the groups of all annotated genes in the corresponding GO or pathway dataset. The protein-coding genes that fall into this neighborhood region of the ncRNAs are input to the enrichment analysis. We also showcase NoRCE’s ability to constrain the input set with protein-coding genes within the TAD boundaries.

Functional enrichment results

The dot plot in Fig. 2 shows the top 35 enriched BP GO-terms, sorted based on the significance of enrichment. The number of annotated genes with the corresponding GO-terms are provided in the graph. We detect RNA related GO terms such as the positive regulation of pri-miRNA transcription by RNA polymerase II, miRNA mediated inhibition of translation, and RNA processing. Additionally, various GO terms are pertinent to various neurological functions such as response to ischemia, sensory perception of pain, and neurogenesis. It has been reported that cerebral ischemia-induced genes are upregulated in schizophrenia [54], and it is common to have chronic pain in bipolar patients [55]. Interestingly, we observe that the enriched terms include cardiac and vascular-related functions. Several studies exhibit interactions between neural diseases and changes in blood vessel pathology and blood flow [56,57,58]. Others reveal that patients with bipolar disorder have low heart rate variability, which is a physiological measure of variation in the time interval between each heartbeat [59, 60].

Fig. 2
figure 2

a Top 35 GO biological processes enriched in the set of differentially expressed brain ncRNAs from [52]. The x-axis represents the p-values; the y-axis represents the GO terms. The dot area is proportional to the size of the overlapping gene set, and the color signifies the p-value of the enrichment test for the corresponding GO-term. b The functional enrichment analysis is repeated with TAD filtering and the GO-term:ncRNA network is provided. The size of the nodes is proportional to the degree of the node. Different colors are used to differentiate the clusters of nodes

Alternative visualizations of these functional enrichment results are provided in the Additional file 1 section. Additional file 1: Figure S1 shows the top 7 GO terms in a network visualization format. Additional file 1: Table S4 lists the top enriched GO BP terms to showcase the tabular format output capabilities.

Functional enrichment results with TAD filtering

We repeat the previous functional analysis when the TAD filtering is on. When this filter is applied, only the protein-coding genes near the ncRNAs in the input list, and at the same time reside within the same TAD regions are included in the enrichment analysis. In this analysis, we use custom defined TAD regions for the adult dorsolateral pre-frontal cortex that are provided by the [52] study, and we keep all the other parameters in their default values.

Figure 2 illustrates the GO-term network for the top 7 enriched GO terms. Alternative representations of these results are provided in the Supplementary Materials (Additional file 1: Table S5, Figs. S2 (A), and S2 (B) ). Interestingly, in this analysis, we identify cell cycle regulation related GO-terms. Cell cycle regulating genes have been associated with autism in GWAS studies [61]. In DNA derived from the pre-frontal cortex, cell cycle regulating genes show autism-specific CNVs [61]. In schizophrenia and bipolar disorder, many genes participate in cell cycle regulation and they have been shown to have differential expression levels [62]. We also identify cell adhesion in this enrichment analysis. Cell adhesion has been reported to be disrupted in autism [63] and schizophrenia [64]. Moreover, in schizophrenia and bipolar disorders, cell adhesion pathways have been reported to contribute to disease susceptibility [65].

Comparison of enrichment analysis with and without TAD-based filtering

We compare the enrichment analyses with and without TAD-based filtering to understand the effect of TAD filtering. When the enrichment analysis is based on only neighborhood genes, we detect 48 enriched biological processes. When we repeat our analysis with TAD filtering, we observe 29 enriched biological processes. The top 10 enriched terms are mostly the same for both analysis; these include cell cycle and cell adhesion-related terms, as well as several cardiac and vascular-related functional terms (positive regulation of cardiac muscle cell proliferation, positive regulation of blood vessel endothelial cell migration and angiogenesis).

Running the enrichment analysis with TAD filtering allows us to uncover brain disorder-related GO-terms that are not identified by the enrichment analysis solely based on neighborhood genes. Using TAD filtering, we distinguish 6 enriched biological processes that have been reported to relate to a brain disorder in the literature, Table 3. Krishnan et al. [66] report that Regulation of Rho protein signal transduction (GO:0035023), somatic stem cell population maintenance (GO:0035019), calcium ion transport (GO:0006816), ubiquitin-dependent protein catabolic process (GO:0006511) are potential ASD related GO-terms. Rho GTPases are important regulators of the neural system, and mutations in Rho GTPases’ regulators and effectors can cause neural diseases, including ALS [67]. Moreover, we also observe positive regulation of phosphatidylinositol 3-kinase signaling (GO:0014068) after TAD-filtering. Phosphoinositide 3-kinase is a well-known pathway that regulates several processes, including proliferation, growth, apoptosis, and cytoskeletal rearrangement [68]. It is linked with several diseases and it is considered as a hallmark of cancer [69]. Kurek et al. [70] show that cancer-associated PIK3CA mutations cause epilepsy, and there is a strong correlation between epilepsy and autism [71]. Moreover, Krishnan et al. [66] report that this GO-term is related to ASD. Detecting neural disease linked GO-terms by employing TAD filtering shows that TAD filtering might help arrive at more precise enrichment results. In conclusion, using different approaches can lead to more nuanced enrichment analysis results.

Pathway enrichment using predefined pathway gene sets

NoRCE enables pathway enrichment analysis for various sources, including KEGG, Reactome pathway, and WikiPathways. Also, NoRCE supports pathway enrichment using custom pathway databases such as MSigDb [53], or other user-curated data provided in GMT format. To showcase the NoRCE capability, we utilize the Bader Lab dataset as the user-defined pathway gene set analysis. In this analysis, we only consider the genes in the neighborhood of the differentially expressed ncRNAs in brain disorders.

Interestingly, we find many enriched pathways that are related to neural diseases. Some of these pathways directly related to ASD, schizophrenia, and bipolar disorder, including Synaptic signaling pathway associated with an autism spectrum disorder, WP4539; Amyotrophic lateral sclerosis, WP2447; Alzheimer’s disease, WP2059. Additionally, we found that many of the signaling pathways, such as G-Protein Signaling, mTOR signalling, MAPK Signaling pathway and those pathways are associated with at least one of the brain disorder: autism spectrum disorder, schizophrenia, and bipolar disorder [72,73,74]. Due to the space limit, we list a subset of enriched disease-related pathways in Additional file 1: Table S6 and S7.

Comparison between ASD associated GO-terms and NoRCE enrichment results

Krishnan et al. [66] predict novel ASD risk genes based on the brain-specific functional network. In their work, they also identified functions potentially dysregulated by ASD-associated mutations. We compare our findings with this set of ASD associated GO function terms [66]. We observe that most of the enriched GO terms reported in NoRCE are listed as potential ASD-related GO terms in Krishnan et al. [66] study. The ncRNA enrichment analysis of NoRCE without TAD filtering identifies 48 enriched GO terms, and 32 of these terms are also in the list of ASD related GO terms [66], corresponding to 67% overlap. When TAD filtering is applied, there are 29 enriched terms, and 21 are in the ASD GO term list, corresponding to 73% overlap (Additional file 1: Table S9). We test the significance of these overlap ratios. We randomly select ncRNA set with the same size of the input from the all gene population. We find the enriched term with this random gene set and checked if the overlap ratio is equal or higher in the randomized case. A p-value is calculated by repeating this procedure 1000 times.

Case study 2: functional enrichment analysis of variably expressed miRNAs in brain cancer using miRNA targets

NoRCE offers a filtering option for the input miRNA’s targets. Users can choose to filter ncRNA neighbors, such that only those that are the targets of the miRNA are included. To demonstrate this option, we use a set of miRNAs that are differentially expressed ncRNAs in brain cancer obtained from dbDEMC 2.0 [75] for the functional analysis. This set contains 407 miRNAs and is provided in NoRCE with the name brain_miRNA. We choose the Reactome pathway as the functional gene set.

We identify lysosome vesicle biogenesis (p-value = 7.1e−05), trans-golgi network vesicle budding (p-value = 0.0006), ion channel transport (p-value = 0.0091), and axon guidance (p-value = 0.0382) pathways as enriched. Previous studies report that the axon guidance and ion channel transport pathways are related to the Glioblastoma Multiforme [76, 77]. Other evidence also suggests that miRNAs could be acting as key fine-tuning regulatory elements in axon guidance [78].

Case study 3: functional enrichment analysis with co-expression analysis

NoRCE also supports filtering based on a co-expression analysis. When defining coding gene neighborhoods for an ncRNA, the user can choose to include a coding gene only if it is co-expressed. Alternatively, the users can choose to augment the coding genes list with the co-expressed coding gene set. To demonstrate this option, we use NoRCE on the brain cancer patient data obtained from TCGA.

The TCGA data include expression levels for mRNA and miRNA for matched primary tumor solid samples from 527 tumor patients. miRNA-seq data are measured as per million mapped reads (RPM) values, and RNA-seq data are measured as Fragments per Kilobase of transcript per Million mapped reads upper quartile normalization (FPKM–UQ). We apply the same pre-processing step as in our previous method [79]. Genes and miRNAs that have very low expression levels (RPKM \(<0.05\)) in many patients (more than 20% of the samples) are filtered out. The gene expression values are log2 transformed, and those with high variability are retained for co-expression analysis. For this aim, only the genes with median absolute deviation (MAD) above 0.5 are used. The final expression dataset contains 444 miRNA and 12,643 mRNA genes on 527 tumor patients on which we perform Pearson correlation analysis. The mRNAs which have more than 0.1 correlation with a miRNA are retained.

When we examine the enriched pathways, Signaling by Receptor Tyrosine Kinases emerges as an important pathway. Receptor Tyrosine Kinases is a cell surface receptor family, and its members are responsible for growth factors, hormones, cytokines, neurotrophic factors [80]. Following their activation, they can signal through downstream pathways responsible for survival, differentiation, and angiogenesis [80]. Inhibition on Receptor Tyrosine Kinases and their signal pathways are utilized as target therapy on brain cancer [81]. Also, miRNAs are reported to take a role as mediators or suppressors in these pathways and promote tumor cell death [81]. The Reactome diagram for this pathway is illustrated in Additional file 1: Figure S4. In both target-based and co-expression-based analyses on the differentially expressed miRNAs, Axon guidance pathway is enriched. MiRNAs’ role in axon guidance have been reported elsewhere [78, 82,83,84]. For example, Baudet et al. [84] report that miR-124 controls Sema3A, which is essential for normal axon guidance. Accumulating evidence also points out that axonogenesis is stimulated by malignant cells and contributes to cancer growth and metastasis [85]. These findings also support the NoRCE capability for finding interesting functional inferences. The details of the enrichment results are provided in Table 2.

Table 2 Pathway enrichment results for nearby co-expressed genes with miRNAs
Table 3 Brain disorder related biological process GO term enrichment results that show the TAD analysis enhancement

Case study 4: functional enrichment analysis of pan-cancer driver lncRNAs

As a fourth case study, we conduct an analysis where the input list comprises lncRNAs. High-throughput sequencing technologies have revealed that there are thousands of lncRNAs whose aberrant expressions are associated with different cancer types [86]. As part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, the Cancer LncRNA Census provides a dataset of 122 high-confidence lncRNAs with causal roles in cancer phenotypes [87]. We utilize these known tumor suppressor or oncogene lncRNAs and conduct enrichment analysis with NoRCE. For this enrichment analysis, we use the neighborhood genes filtered based on TAD boundaries obtained from 3D Genome Browser [88]. All other choices and parameters are set to their default values.

Enrichment analysis of this set of 122 cancer associated lncRNAs yields 11 enriched biological processes (Additional file 1: Table S11). Interestingly, 2 of the 11 enrichment biological processes are related to miRNA processes. This could indicate that some of the cancer related lncRNAs are located in the same TAD regions as the miRNA host genes. Moreover, we detect developmental processes such as anatomical structure development, multicellular organism development, anterior/posterior pattern specification. This may indicate that these cancer related lncRNAs have a role in developmental processes. Also, analysis on the network for the enriched GO-term and their annotated genes, Additional file 1: Figure S5, demonstrate that the RNA process and their annotated genes form a separate graph from other enriched terms and genes.

We repeat the same analysis without considering the neighborhood gene information. In this case, we only consider the coding genes that partially overlap with the input lncRNA set and fall into the same TAD boundary with the lncRNA genes. This way we are able to measure the effects of including genes nearby on the genomic sequence for enrichment analysis. We detect 8 enriched biological processes. When we compare our findings with results obtained for enrichment analysis based on neighborhood genes filtered with TAD boundaries, we are unable to detect two developmental process (anterior/posterior pattern specification and anatomical structure morphogenesis and one miRNA related GO-term (miRNA mediated inhibition of translation. This finding is a subset of results that are obtained by cis-based gene enrichment filtering with TAD boundaries. Thus, we recommend carrying out enrichment analysis by combining multiple information sources such as cis genes, TAD boundaries, co-expression analyses.

Discussion

In showcasing NoRCE, we analyzed sets of ncRNAs implicated in diseases, including brain disorder related ncRNAs and cancer-related lncRNAs and miRNAs. Functional enrichment of these ncRNAs yielded interesting biological findings highlighting how NoRCE could be useful in answering a wide range of questions. The datasets and examples showcased here are also provided in the R\(\setminus\)Bioconductor package.

NoRCE uses functional sets such as those derived from GO and pathway databases and miRNA prediction tools. Improvements in these databases and tools allow NoRCE conduct more accurate analysis. NoRCE is designed for non-coding RNAs, but can use both coding and non-coding RNAs as input. Currently, the user can use NoRCE to conduct analysis in human and mouse, rat, zebrafish, fruit fly, worm, and yeast. As a future direction, NoRCE can be extended to support analysis for other species. Moreover, the current version of the package contains only miRNA target predictions. However, NoRCE can be enhanced by including target prediction for other ncRNAs, including sRNAs and snoRNAs.

Conclusions

NoRCE is a comprehensive, flexible, and user-friendly tool for enrichment analysis of all types of ncRNAs. It works for multi-species and is available as an R package. NoRCE, unlike existing tools, conducts enrichment by taking into account the genomic neighborhood of the ncRNAs in the input set and transfers functional annotations of these coding genes. We should note that although cis-regulation has been reported for many ncRNA types, it may not hold for all types of ncRNAs. Therefore, in addition to the genomic neighborhood-based analysis, NoRCE allows the standard approaches of using coding genes co-expressed with the input ncRNAs in detecting the enriched functions. Another unique feature of NoRCE that it allows an option for making use of TAD regions. NoRCE provides flexibility to the user; the user can perform analysis with different options and use the library’s readily available datasets to conduct the analysis or input custom datasets. It is also possible to include or exclude any analysis that NoRCE contains.

Availability and requirements