Background

RNA sequencing continues to grow in popularity as an investigative tool for biologists. A vast variety of RNA sequencing analysis methods allow researchers to compare gene expression levels between different biological specimens or experimental conditions, cluster genes based on their expression patterns, and characterize expression changes in genes involved in specific biological functions and pathways.

Specific tools exist to perform the tasks described above (see the Discussion section and Additional File 1: Table S1 for a detailed comparison of available tools). However, most analysis tools can only perform a subset of these tasks. Any out-of-the-ordinary research questions require researchers to write customized analysis scripts, which may not be easy to share or replicate. Moreover, many of the existing tools require users to be familiar with reading and writing code, making them usable only by researchers experienced in computer programming.

RNAlysis offers a solution to these problems by (1) using a modular approach, allowing users to either analyze their data step-by-step, or construct reproducible analysis pipelines from individual functions; and (2) providing an intuitive and flexible graphical user interface (GUI), allowing users to answer a wide variety of biological questions, whether they are general or highly specific, and explore their data interactively without writing a single line of code. RNAlysis includes thorough documentation and step-by-step guided analyses, to help new users to learn the software quickly and acquire good data analysis practices (available online or as Additional File 2: Tutorial).

Implementation

RNAlysis was designed to perform three major tasks: (1) pre-processing and exploratory data analysis; (2) finding gene sets of interest through filtering, clustering, and set operations; (3) visualizing intersections between gene sets and performing enrichment analysis on those sets (Fig. 1).

Fig. 1
figure 1

The workflow of RNAlysis. Top section: a typical analysis with RNAlysis can start at any stage from raw/trimmed FASTQ files, through more processed data tables such as count matrices, differential expression tables, or any form of tabular data. Middle section: data tables can be filtered, normalized, and transformed with a wide variety of functions, allowing users to clean up their data, fine-tune their analysis to their biological questions, or prepare the data for downstream analysis. RNAlysis also provides users with a broad assortment of customizable clustering methods to help recognize genes with similar expression patterns, and visualization methods to aid in data exploration. All of these functions can be arranged into customized Pipelines that can be applied to multiple tables in one click, or exported and shared with others. Bottom section: Once users have focused their data tables into gene sets of interest, or imported such gene sets from another source, they can use RNAlysis to visualize the intersections between different gene sets, extract lists of genes from any set operations applied to their gene sets and data tables, and perform enrichment analysis for their gene sets, using either public datasets such as GO and KEGG or customized, user-defined enrichment attributes

Input

RNAlysis can interface with existing tools, such as CutAdapt, kallisto, bowtie2, featureCounts, limma, and DESeq2 [1,2,3,4,5,6,7,8], to enable users to run basic adapter-trimming, RNA sequencing quantification, read alignment, feature counting, and differential expression analysis through a graphical user interface. That is to say, users can begin their analysis with RNAlysis with sequencing data at any stage. Alternatively, users can load into RNAlysis data tables that were generated elsewhere. RNAlysis has a tab interface, which allows users to examine and analyze multiple data tables in parallel, seamlessly switching between them.

RNAlysis can accept data from any organism. RNAlysis can analyze gene expression matrices (raw or normalized), differential expression tables, or user-defined gene sets of interest. Moreover, RNAlysis accepts annotations for user-defined attributes of genes. Since RNAlysis works with tabular data, RNAlysis is applicable to any type of data table.

Data validation and pre-processing

First, RNAlysis allows users to validate their data by summarizing and visualizing its patterns and distribution. For instance, users may compare the distribution of gene expression between samples through scatter plots and pair plots, or examine general trends in the data, as well as potential batch effects, via clustergram plots and PCA projections.

Moreover, RNAlysis allows users to pre-process their data by normalizing it through one of the various methods (such as median of ratios, relative log ratio, trimmed mean of M-values, and more) [3, 9,10,11,12], filtering out lowly-expressed genes, and eliminating rows with missing data from their tables.

Data filtering and clustering

After data pre-processing, users can filter their data tables according to a broad array of parameters, depending on the nature of their data and biological questions. These filtering functions can be applied in particular orders and combinations to suit the user’s specific needs. These functions include, among many others, filtering by statistical significance and/or the direction and magnitude of fold change, filtering genomic features by their type, performing set operations between different data tables and gene sets (for instance — intersections, differences, majority vote intersections, and so on) between tables, etc.

One of the powerful features of RNAlysis is the ability to easily extract gene lists from set operations applied to the user’s tables and gene sets, and use these lists in downstream analyses. Users can do this by applying a pre-defined set operation (like intersection or difference) or by hand-picking subsets of interest through an interactive graphical platform.

Finally, RNAlysis allows users to cluster genes based on the similarity of their expression patterns. RNAlysis supports an extensive selection of clustering algorithms, including distance-based clustering (K-Means, K-Medoids, Hierarchical clustering), density-based clustering (HDBSCAN) [13], and ensemble-based clustering (a modified version of the CLICOM algorithm) [14].

Moreover, RNAlysis provides users with a wide array of distance metrics for clustering analysis. This includes the implementation of distance metrics that were specially developed for biological applications, such as time-course gene expression data [15], and distance metrics that were empirically found to best suit transcriptomics analysis [16].

Modularity and building customized pipelines

Filtered data tables can be saved or loaded at any stage during the analysis. The operations performed on the data and their order will be automatically reflected in the output files' names. Additionally, any operation applied to the data can be undone with a single click, and RNAlysis displays the history of commands applied to each table in the order they were applied.

As mentioned earlier, users can “bundle” any of the functions RNAlysis offers into distinct Pipelines, which can then be applied in the same order and with the same parameters to any number of similar data tables. This helps users to save time and avoid mistakes and inconsistencies when analyzing a large number of datasets. Pipelines can also be exported and shared with other researchers, who can then use these Pipelines on any machine that installed RNAlysis. This feature makes analysis pipelines easier to report and share, increasing the reproducibility and transparency of bioinformatic results.

Enrichment analysis

After applying the analyses mentioned above to summarize data tables down to gene sets of interest, users can carry out enrichment analysis for those gene sets. Gene set enrichment analysis is a collection of methods for identifying classes of genes, biological processes, or pathways that are over- or under-represented in a gene set of interest [17]. Enrichment analysis is highly prevalent in RNA sequencing analysis since it allows researchers to associate a differentially-expressed gene set with underlying biological functions [18, 19].

RNAlysis supports multiple approaches and statistical methods for enrichment analysis, including classic gene set enrichment analysis, permutation tests [20], background-free enrichment analysis [21, 22], and enrichment for ordinal or continuous variables.

RNAlysis can automatically retrieve enrichment analysis annotations of all major model organisms from widely accepted databases such as Gene Ontology categories and KEGG pathways [23, 24]. However, unlike many other analysis pipelines, RNAlysis also accepts annotations for user-defined attributes and groups (see Additional File 1: Table S1). This allows users to tailor their analyses to their specific needs and biological questions.

Documentation and accessibility

While RNAlysis can be operated entirely within a graphical interface, all the functions and features RNAlysis offers can also be imported and used in standard Python scripts, allowing users with coding experience to further automate and customize their bioinformatic analyses. Pipelines that are generated with the graphical interface can be imported and used through the programmatic interface and vice versa, maximizing the reproducibility and shareability of analyses performed with RNAlysis.

Moreover, whenever RNAlysis integrates with external tools and packages for tasks such as alignment or differential expression, RNAlysis will export the exact command line or R script used to produce the results, allowing all users to share the full details of their analysis.

RNAlysis includes extensive documentation to guide new and returning users. A user guide and a tutorial offer a bird’s eye view of the modules and features of RNAlysis, along with video demonstrations, usage examples, and recommended practices. A frequently asked questions page offers solutions to common questions and issues posed by users of RNAlysis. The remainder of the documentation provides a complete reference of the functions and features available in RNAlysis. Users can look up specific entries for a more thorough review of their theoretical background, use cases, and optional parameters.

RNAlysis can either be installed as a Python package on all standard operating systems, or downloaded as a stand-alone application, which does not require users to install any mandatory dependencies. This simplifies the acquisition process of RNAlysis, eliminating the initial barrier of entry for less computationally-oriented users.

The project is available as an open-source, public GitHub repository. Numerous test cases are also provided within the package, which are executed automatically every time the source code is updated, ensuring that data analysis with RNAlysis remains consistent and reliable.

RNAlysis is powered by various open-source projects [15, 25,26,27,28,29,30,31,32,33,34] which are installed automatically and used when needed.

Results

We examined the ability of RNAlysis to facilitate analyses of multiple different publicly available datasets [35,36,37,38]. First, we analyzed time-series gene expression data, using clustering analyses to group genes based on their expression pattern. Then, we demonstrated the analysis of multiple RNA sequencing datasets from raw FASTQ files, showing the applications of analysis Pipelines and set operations between datasets. A step-by-step tutorial of these analyses is available in the online RNAlysis documentation (also available as Additional File 2: Tutorial).

Analysis #1: Exploring gene expression patterns across the development of Caenorhabditis elegans

In the first analysis, we examined a dataset describing average gene expression under different developmental stages of Caenorhabditis elegans nematodes, derived from the control samples of many publicly available RNA sequencing experiments [35].

Exploratory data analysis revealed that the different developmental stages show the highest correlations between contiguous developmental stages (Fig. 2A), and PCA uncovered a semi-circular pattern, with over 75% of the data’s variance explained by the first two principal components (Fig. 2B). Interestingly, the first principal component arranges the samples by their relative germline content, with embryos and adult nematodes on one end, L1–L3 larvae on the other, and L4 larvae in between. The second principal component arranges the samples by their developmental stage, with embryos at the top of the graph and adults at the bottom.

Fig. 2
figure 2

Exploratory data analysis reveals patterns in time-series gene expression data. A Principal component analysis projection of the time-series data. Depicted are the first two principal components, explaining > 75% of the variance in the data. Data was power-transformed and standardized before the analysis. B Pair-plot, depicting the pairwise Spearman correlation between each pair of samples, and a histogram of normalized gene expression in each sample. Each dot represents the log of normalized expression of a single gene

Next, we extracted clustering results at three different resolutions by using exemplars from three different classes of clustering algorithms: a distance-based algorithm (K-Medoids) (Additional File 3: Figure S1), a density-based algorithm (HDBSCAN) (Additional File 4: Figure S2), and an ensemble-based algorithm (CLICOM) (Fig. 3). While one of the most challenging aspects of RNA sequencing clustering analysis is the requirement to specify in advance the number of clusters, RNAlysis provides unbiased clustering methods that can either estimate a good number of clusters to detect, or require no such input at all — instead specifying the smallest cluster size that would be of interest to the user.

Fig. 3
figure 3

Clustering analysis of time-series gene expression data. A Clustering analysis of the data using modified CLICOM clustering, using five underlying clustering setups, an evidence threshold of 50%, and a minimal cluster size of 75 [14]. Clusters are sorted by their size. Each graph depicts the power-transformed and standardized expression of all genes in the cluster, with the center lines denoting the clusters’ means and standard deviations across developmental stages of C. elegans nematodes. B PCA projection of the power-transformed and standardized gene expression data. Each dot represents a gene. The points are colored according to the cluster they belong in the CLICOM clustering result depicted in A

While the data examined here only contains a single entry for each experimental condition, RNAlysis is well suited for clustering analysis of replicate data, since it's able to cluster each batch of replicates separately and combine the results of those batches, resulting in more accurate and robust clustering results [39].

Finally, we plotted the expression level of specific genes of interest under the different developmental stages (Fig. 4A) and performed GO enrichment analysis on one of the clusters we previously detected, revealing a robust enrichment for neuropeptide signaling pathways (Fig. 4B).

Fig. 4
figure 4

Gene expression plots and enrichment analysis of time-series gene expression data. A Normalized gene expression values of the time-series data for the two sample genes oma-1 (WBGene00003664) and skn-1 (WBGene00004804). B GO enrichment ontology graph, depicting enrichment results for cluster #9 (genes whose expression decreases along development from L1s to adults) (see Fig. 3). The graph depicts the hierarchical relationship between the GO terms. Each GO term was colored according to its log2 (Fold Enrichment) score if it was statistically significant (q-value ≤ 0.05)

Analysis #2: Measuring the effect of stress on the expression of small RNA factors

In the second analysis, we analyzed three datasets that examined the effects of three different stress conditions (osmotic stress, heat shock, and starvation) on gene expression [36,37,38]. This is a replication of our previously published analysis [40] done with an earlier version of RNAlysis (version 1.3.5, 2019), where the purpose was to examine the effects of stress exposure on the expression of small RNA factors. This analysis shows how RNAlysis facilitates answering highly specific biological questions in an intuitive manner.

We started the analysis with raw FASTQ files, applying adapter trimming, pseudo-alignment, transcript expression quantification, and differential expression analysis to the three datasets, all executed through the RNAlysis graphic interface.

Next, we examined the distribution of differentially expressed genes under each condition with a Volcano Plot (Fig. 5A) and extracted from each differential expression table the lists of significantly up-regulated and down-regulated genes. This step was automated by building and applying a Pipeline, allowing us to analyze all three tables in the exact same manner with the click of a button.

Fig. 5
figure 5

Analysis of stress-induced gene expression changes. A Volcano plot depicting differential expression results, comparing worms that experienced starvation (str) to worms that grew under normal conditions. Each dot represents a gene. Differentially expressed genes with log2 (fold change) ≥ 1 were painted in red, and differentially expressed genes with log2 (fold change) ≤  − 1 were painted in blue. B A proportional Venn Diagram depicting the intersections between genes that are significantly downregulated under heat shock, osmotic stress, or starvation, compared to their matching control samples. C Log2 (Fold enrichment) score for a curated list of epigenetic genes, in the set of genes significantly downregulated under all stress conditions. The p-value for enrichment was calculated using 10,000 random gene sets identical in size to the tested group. *** indicates adj. p-value ≤ 0.001

Following these filtering steps, we examined the intersection of the up- and down-regulated genes between the different stress conditions (Fig. 5B) and extracted the list of genes that are significantly up/down-regulated under all stress conditions. We then created an appropriate background set for enrichment analysis by calculating the union gene lists of all genes which are sufficiently expressed under at least one stress condition.

Finally, we ran an enrichment analysis on the stress-downregulated genes, measuring whether they are significantly enriched for a user-defined list of epigenetic-related genes (Fig. 5C). We found the stress-downregulated genes to be significantly enriched for epigenetic-related genes, as previously shown [40]. Enrichment for user-defined attributes is a feature unique to RNAlysis, allowing users to answer highly specific biological questions. This means that the users are not limited to widely available datasets, but can directly analyze any gene sets and attributes of interest without the need to write any code.

Discussion

RNAlysis offers researchers a robust, scalable, and easy-to-use tool to analyze RNA sequencing data. RNAlysis was designed not only to be intuitive and approachable for new users, but also to provide a high degree of efficiency, control, and robustness to experienced bioinformaticians.

Other useful software tools for the analysis of RNA sequencing data exist (see Additional File 1: Table S1). For example, Galaxy [41, 42] is a web-based scientific analysis platform for the analysis of biological data. Galaxy offers many shared features with RNAlysis, including integration of existing analysis tools, extensive documentation, shareable pipelines, and the ability to filter, sort, and intersect data tables. Galaxy has a large following [42] and supports a wide array of established tools from various fields of bioinformatics. Contrary to Galaxy, RNAlysis aims to simplify commonly used actions for RNA sequencing analysis in particular, such as filtering and set operations. This is done by providing users with dozens of ready-made filtering functions relevant to RNA sequencing data and supporting set operations on an arbitrary number of datasets with an intuitive, point-and-click interface. Moreover, RNAlysis offers analysis methods that are especially useful to RNA sequencing data, such as advanced clustering methods and enrichment analysis for user-defined attributes, which are not available on Galaxy.

Tools such as ARPIR [43] and NetSeekR [44] can take users all the way from the alignment of reads and differential expression analysis through GO enrichment and other tertiary analyses such as gene network analysis. Other tools like ideal [45], PIVOT [46], and DEBrowser [47] provide users with a graphical interface to perform differential expression analysis and enrichment analysis.

While these tools allow less experienced bioinformaticians to perform basic transcriptomic analysis, they are limited in their capability to filter datasets, perform set operations between datasets, use more advanced clustering algorithms, or automate and streamline data analysis with pipelines. In contrast to these tools, RNAlysis is highly modular and customizable, allowing users to tailor their investigations to their biological questions through advanced data filtering, intersecting multiple datasets, and a high degree of control over analysis parameters at every stage of the process. Moreover, RNAlysis can analyze RNA sequencing experiments from start to finish since it supports pre-processing, alignment, and quantification utilities of FASTQ files.

Conclusions

RNAlysis offers a modular toolbox for RNA sequencing data analysis, with the unique combination of an intuitive graphical interface and highly customizable analysis workflows, making it distinct from many other RNA sequencing analysis tools.

We believe that the ability to build customized and reproducible analysis pipelines, combined with the user-friendly interface, will allow researchers to gain novel biological insights from RNA sequencing data easily.

Methods

For a detailed description of the analysis pipelines used to generate the results displayed here, see Additional File 2: Tutorial.

Statistical analysis

GO Enrichment analysis (Fig. 4) was performed as follows: GO annotations were retrieved through the GO Solr search engine API called GOlr. Annotations were filtered using the parameters described in Additional File 2: Tutorial: kept only annotations for C. elegans (taxon ID 6239), excluded “NOT” annotations. Annotations were propagated using the ELIM algorithm [48]. P-values were calculated using the hypergeometric test, and corrected for multiple comparisons using the Benjamini/Yekutieli FDR method for general or negatively correlated tests [49].

Enrichment for user-defined attributes (Fig. 5) was performed as follows: Annotations were generated as described in Additional File 2: Tutorial. P-value was calculated using a permutation test, using 10,000 random gene groups identical in size to that of the examined group of differentially expressed genes. P-value was then calculated using the formula p = (successes + 1)/(repeats + 1), where success are defined as randomized gene set with a proportion of positively-annotated genes equal to or larger than the proportion in the test set. This formula results in a positively-biased estimator of the real p-value (a conservative estimate of the p-value).

Availability and requirements

Project name: RNAlysis.

Project home page: https://github.com/GuyTeichman/RNAlysis

Operating system(s): Platform independent

Programming language: Python 3

Other requirements: Python 3.7.9 or higher (optional), GraphViz 3.0 or higher (optional), kallisto 0.44.0 or higher (optional), bowtie2 2.3.5 or higher (optional), R 4.0.0 or higher (optional), Microsoft C++ Build Tools 14.0 or higher (optional, on Windows computers only), Perl 5.9 or higher (optional)

License: MIT

Any restrictions to use by non-academics: none