Background

In recent years, a growing number of high-throughput genomic technologies have become available for biomedical research and are jointly providing high-resolution genomic data that fuel the revolution of personalized medicine [1, 2]. These technologies (collectively named omics) allow the simultaneous quantification of a large number of features at various biological levels. The features include gene expression (mRNA and miRNA abundance levels measured by microarrays or RNA-Seq), protein expression (measured by mass spectroscopy or reverse-phase protein arrays), DNA methylation (methylation arrays), copy number variation (SNP arrays), and others [3, 4]. The technologies vary broadly in the number of features they measure as well as in the distribution of measured values [5]. However, they can typically be summarized as a numeric matrix where columns represent samples and rows represent biological features (often correlating to genes). Bioinformatic analysis of such genomic matrices has been extensively used for identifying biologically distinct sample groups, and for revealing groups of correlated biological features [6, 7].

The number of tumor samples and measured features that are included in a typical cancer genomic dataset have grown dramatically in the last few years, owing to increasing resolution and reduced costs of array and sequencing technologies. Modern repositories comprise thousands of patient samples and many thousands of features. Investigation of such large datasets is computationally challenging as it requires robust software tools for supporting the analysis of both samples and features in high dimensional data [8].

In addition to genomic data, modern cancer datasets can include extensive medical information (labels) describing each sample, such as clinical properties or assignment to a predefined phenotype. These clinical labels make it possible to fuse genomic and clinical data in various ways in order to discover new insights based on feature-phenotype associations. Common clinical labels in cancer datasets include disease subtypes, pathological stages, survival and recurrence follow-up information, as well as response to treatment. Identification of genomic features that are correlated with significant clinical parameters (biomarkers) is expected to play a significant role in the field of personalized medicine, by which the status of multiple biomarkers may improve subtype diagnosis and guide therapeutic decisions [9, 10].

The Cancer Genome Atlas (TCGA) is an example of a revolutionary multi-label multi-omic genomic database [11]. It includes more than 11,000 samples from 33 types of cancer, where each sample was measured using multiple omic technologies and was described by dozens of clinical labels [12]. Many studies have already analyzed TCGA data, improving the subtyping of cancers and shedding light on the biological mechanisms underlying the development of various cancer types [13,14,15]. Such analyses are typically time-consuming, computationally challenging, and entail team effort, as they require applying a diverse array of methods, statistical tools, and algorithms, and often also require writing extensive computer code to perform and interweave the various steps of the analysis [16]. Hence, to effectively extract clinically meaningful insights from such multi-omic multi-label databases, specialized agile integrative tools are required.

To address this challenge, we developed PROMO (PROfiler of Multi Omic data), a fully interactive software suite capable of quickly importing, preprocessing, visualizing, analyzing and reporting the results on cancer datasets in a seamless fashion, without writing a single line of computer code. PROMO includes an extensive array of bioinformatic methods for performing major common analysis types including exploration, visualization, identification of clinically significant disease subtypes, revealing co-regulated feature groups, biomarker discovery, simple classification and integrative multi-omic analysis. Table 1 presents an overview of the fundamental analysis types available in PROMO.

Table 1 PROMO’s main analysis types

An early version of PROMO was developed as part of a study where we identified distinct prognostic subgroups in Luminal-A breast tumors based on expression and methylation data [17]. The analysis workflow in that project provides an example of the key steps in a typical application of PROMO (Fig. 1): Data are imported, filtered and preprocessed. Tumor samples are clustered into groups that are then assessed for clinical significance using survival analysis and statistical tests on the clinical labels. Clustering of the genes followed by gene enrichment analysis associates sample clusters with active gene functions. The analysis is summarized visually in a genomic matrix clearly showing the identified sample clusters and their association to important clinical labels (Fig. 1, step 4), in addition to downstream analysis methods (Fig. 1, steps 5–7).

Fig. 1
figure 1

PROMO’s subtype discovery workflow – From data import to subtype classifier. This figure outlines the complete workflow by which PROMO can be used for identifying and characterizing clinically distinct cancer subtypes: (1) Importing genomic data together with clinical information in one of several available formats. (2) Preprocessing the data and preparing it for downstream analysis. (3) Verifying the integrity of the data, characterizing its distribution and exploring dataset properties with respect to the available clinical labels. (4) Employing clustering algorithms partition both samples and features (genes) into groups. (5) Applying enrichment tests to identify clinically significant sample subtypes and groups of co-regulated genes. and to characterize their function. (6) Statistical tests identify features that distinguish between different sample subtypes as well as survival related features. (7) Decision tree classifiers can be generated for formulating a set of rules by which a new sample can be classified

In this paper, we describe PROMO’s main features and demonstrate its use in a study of a breast cancer cohort [14].

Results

We now describe PROMO’s main features, organized by analysis steps. The described features can be accessed using PROMO’s menus or graphical user interface (Fig. 2). The dataset used was TCGA’s breast cancer gene expression profiles (1218 samples downloaded from UCSC’s XENA website in May 2018). It is also available on the datasets page of PROMO’s website.

Fig. 2
figure 2

PROMO’s graphical user interface. a PROMO’s main screen. The genomic matrix is in the center with columns corresponding to samples and rows to features. Colors represent feature values according to the scale on the right. The colorful label bar beneath the matrix displays the currently selected sample label. Analysis steps are documented in the textbox on the bottom of the screen. Key commands are available on the tabbed panels on the left of the screen. b The Preprocessing panel allows filtering, normalization, and sorting of the genomic data. c Clustering the dataset’s samples and features using various algorithms and distance functions is available through the Clustering panel. Resulting clustering solutions are aggregated for future review and filtering. d The Analysis panel provides access to several visualization and exploratory tools like PCA, t-SNE, survival analysis, biomarker discovery, GO enrichment and automatic classifier generation. e In the Dataset Collection panel, several genomic matrices can be assembled into a multi-omic dataset collection, and then analyzed together

Data import and preprocessing

In all analysis types, the first steps are to import the required data from local files into PROMO and prepare it for the analysis. PROMO enables the integration of data of different types and from multiple sources by importing genomic matrices, sample labels, and sample or gene partition files. Genomic matrices accompanied by complementary phenotypic information (clinical labels) can be loaded in the following formats: tabular text files, Gene Expression Omnibus (GEO) [18] series files (including direct download from within PROMO), UCSC’s XENA [19, 20] file formats (available for many public datasets including all TCGA’s data), and PROMO’s DSC files. The latter are precompiled multi-omic datasets available at PROMO’s dataset download page for selected TCGA cohorts. PROMO also allows separate loading of additional clinical labels and sample partition files to be used in the subtype discovery workflow.

After import, the loaded dataset can be ‘cleaned’ by filtering out samples based on clinical label values, and also by removing certain features (e.g., removing low variability genes or keeping only specific genes). Additional available common preprocessing steps include flooring, ceiling, and row normalization.

Data exploration and visualization

Once a genomic matrix is loaded to PROMO, its properties can be explored with respect to any selected clinical label (Fig. 3a). The samples (columns) in the matrix can be reordered based on any clinical label or by their mean expression. Basic dataset properties like value distribution (Fig. 3b), clinical label distribution (Fig. 3c), and sample variation (Fig. 3d) can be studied and displayed graphically in various ways including PCA [21, 22] and t-SNE [23]. For ease of interpretation, all displays consistently use the same colors to represent the various sample subgroups.

Fig. 3
figure 3

Visualization of multi-label genomic data. PROMO provides a variety of methods for visualizing a genomic dataset together with its associated clinical information. a A multi-label expression matrix plot. The plot is composed of a heat-map representation of the genomic matrix and several label bars beneath it showing different clinical labels that the user interactively selected. The colors in each label bar show the label value of each sample according to the legend on the right. The label appears below the lower left corner of the bar. Here, breast cancer patient profiles were grouped according to their PAM50 category (shown in the top label bar). By observing the distribution of values in other bars, relations between the groups and the labels can be observed. For example, the ER, PR and HER2 status of most samples in the ‘Basal’ group are negative, while the HER2 status of most ‘HER2’ group is positive. b Data distribution and c Clinical label distribution can be explored and visualized separately, or in combination using plots such as d PCA and others. These figures show that the Basal tumor samples are mainly characterized by Negative ER, PR and HER2 labels (a) and markedly differ from all other subtypes in their gene expression pattern (d), in accordance with the literature [14]

Clustering and enrichment analyses

A major effort in promoting precision medicine is to identify disjoint groups of similar patients and characterize each group using its distinct genomic profile, survival data, and clinical information. To reveal the similarities among patients, clustering is often performed on both samples and features [24]. Clustering the samples can reveal patient groups corresponding to disease subtypes [25] while clustering the features reveals groups of co-regulated genes [26]. PROMO provides various clustering algorithms such as K-means [27], hierarchical clustering [28], and Click [29] (PROMO’s clustering panel is shown in Additional file 1: Figure S1). To explore the resulting clusters, the reordered matrix can be visualized in comparison to multiple sample labels (Fig. 4a).

Fig. 4
figure 4

Identification and characterization of cancer subtypes. Unsupervised analysis followed by enrichment analysis is performed on both samples and features for identifying clinically significant samples groups, and for biologically characterizing them based on the functions of co-expressed gene groups. a The RNA-Seq expression matrix of TCGA’s breast cancer cohort after clustering both samples (columns) and genes (rows) into four clusters using the K-means algorithm. Clustering is based on the top 2000 variable genes. White lines separate clusters in each dimension. The bars below the matrix show selected sample labels (here: the clustering and PAM50). Matrix and bars were created using PROMO’s multi-label matrix drawing. b Gene clusters were characterized using PROMO’s gene ontology enrichment tool. The figure shows the five most significant GO terms for every gene cluster. c-e Sample clusters were characterized using the sample clinical labels: c PROMO’s multi-label analysis tool automatically tests the clinical labels of different types (numeric, ordinal, categorical or survival) for enrichment on the sample clusters. FDR correction is performed over all clinical labels of the same type but separately for different types. The various d Sample clusters can also be characterized for a single label by showing its value distribution in each cluster and by calculating enrichment. e Survival functions for each cluster. The p-values are the significance of the separation of each cluster from the rest using the log-rank test

After the genes have been clustered, the built-in Gene Ontology tool can help interpret the biological meaning of gene clusters using enrichment analysis (Fig. 4b) [30]. Likewise, the clinical labels on the samples can be used to statistically characterize each sample cluster. A comprehensive analysis can be applied to each sample cluster using all clinical labels available for the cohort (numeric, ordinal, categorical, or survival labels). The result is a characterization of each cluster, together with FDR corrected p-values [31, 32] in a unified report (Fig. 4c). Enrichment tests for the sample clusters can also be performed using any selected single clinical label (Fig. 4d). Finally, survival analysis performed on the sample clusters can test their prognostic value using Kaplan-Meier plots [33] and log-rank (Mantel–Haenszel) test [34](Fig. 4e). Taken together, PROMO’s clustering and automatic multi-label enrichment analysis can quickly partition both samples and features into distinct groups and assess their biological meaning using the clinical labels.

Identification of distinguishing genes and features (biomarker discovery)

Having obtained patient subgroups of interest, either by sample clustering or using a predefined sample label, we may wish to identify distinguishing genes and features that differ significantly among sample groups. Such differentially expressed genes can shed light on the biological difference between sample clusters, and act as biomarkers for classifying a new sample to a sample class.

After selecting the label and the groups that will be compared, PROMO enables the application of various statistical tests for identifying genes that are differentially expressed among the groups. The p-values obtained by the tests can be used for gene sorting, filtering and for clustering the genes into up-regulated and down-regulated groups. PROMO’s Gene Ontology enrichment analysis can be executed on the resulting gene groups for characterizing the function of up-regulated and down-regulated genes. FDR correction and fold-change based filtering are also supported. PROMO’s biomarker discovery panel and an example of its output are shown in Additional file 1: Figure S2.

For detecting survival biomarkers, PROMO can rank all genes by their association to survival, based on Cox regression analysis [35]. In addition, the user can use the expression levels of selected genes to generate a new sample label (for example, HER2_Low and HER2_High). Kaplan-Meier plots can then be used to estimate the significance of survival differences between sample groups defined by the new label.

Lastly, PROMO can help in finding genes that are functionally related to a given gene of interest by ranking all genes based on their correlation to it. Altogether, the various techniques described here and implemented in PROMO can quickly identify genes that take part in the biological differences between sample groups and may serve as biomarkers for the selected label.

Automatic generation of a simple molecular classifier

After having partitioned the dataset samples, characterized the sample groups and their genes, and established the clinical relevance of the groups, PROMO can build an algorithm to classify a new sample into one of the groups. Such a classifier, especially if based on a small number of genes (rather than the thousands used to identify the subgroups) can serve as a significant step towards translating the analysis results into a diagnostic biomarker for clinical use.

Of the many possible classifier types, decision trees have the advantages of being easy to understand, highly interpretable biologically and easily visualized [36]. Furthermore, they allow for controlling the tradeoff between accuracy and simplicity. For predicting any selected sample label, PROMO can generate a simple decision tree with a single click (Fig. 5). The generated decision tree can be visualized graphically, specified textually, and saved to a Matlab file as a function. Automatic cross-validation and parameter optimization make it easy for the user to come up with a simple decision tree that may be in future subtype classification kits. It is also possible to generate a large number of random trees and rank the genes by the frequency of their appearance in the trees, thus identifying informative features for subtype classification.

Fig. 5
figure 5

Automatically generated decision tree for classifying breast tumors into the four PAM50 classes. PROMO can generate a cross-validated decision tree for any selected sample label using the currently loaded matrix as training data. In this figure, a four-gene molecular classifier for breast cancer subtypes is presented, showing a 7.77% loss on the training data, and a 15% averaged loss on 10-fold cross-validation

Integrative multi-omic analysis

In multi-omic datasets, each sample is characterized by several omic profiles (e.g., gene expression, methylation, copy number). Integrative analysis of multi-omic cancer datasets has the potential of revealing biological regulatory patterns that are missed in single omic analysis, and tools for performing such analyses are currently in great demand [37, 38].

PROMO provides several features for handling and analyzing multi-omic datasets. The profiles composing a multi-omic dataset can be imported from repositories into a ‘Multi-Omic Dataset Collection’ in PROMO (Fig. 2e). The user can navigate between the matrices, edit them independently, and select a subset of the datasets for downstream integrative analysis. Precompiled dataset collections for several TCGA cancer type cohorts are available on PROMO’s download page.

After setting up a multi-omic collection, the “inter-omic correlation identification” feature helps to detect correlations between features in two selected omics. This feature allows the identification of correlations between features from different biological levels. For instance, anti-correlation between mRNA expression and DNA methylation levels can pinpoint biological regulation.

The “Multi-omic clustering” feature can be used to cluster the dataset samples based on several omic matrices simultaneously. To this end, PROMO provides implementations of the multi-omic algorithms SNF [39], NEMO [40], and Consensus Clustering [41] modified for multi-omic data. Additional file 1: Figure S4 demonstrates the application of a multi-omic clustering algorithm on three different omics of the TCGA’s breast cancer cohort.

Discussion

Recent cancer projects such as TCGA [11], GDC [42], ICGC [43] as well as the GEO [18] database, provide the research community with a wealth of omic profiles and extensive clinical information on cancer patients [44]. Analysis of the data is challenging and requires advanced bioinformatics, statistical, and programming skills. A thorough analysis of these datasets - and larger ones expected in the future - by many researchers is crucial for improving cancer diagnosis and treatment.

PROMO aims to fill in a gap in available analysis tools for such large genomic and clinical cancer datasets. It is an interactive tool that is freely available and supports a rich collection of analysis methods and facilitates useful workflows for data exploration and visualization, cancer subtype identification, biomarker discovery and integrative multi-omic analysis. (See Table 2 for a list of the key features). PROMO’s support for large sample size in addition to features like survival analysis and interrogation of the clinical data on sample clusters make it especially suitable for analyzing modern cancer datasets. While many of PROMO’s features are also available in other tools (Table 3), PROMO is unique in its comprehensiveness, support for large sample dimension and the spectrum of tools it provides.

Table 2 PROMO’s key features
Table 3 Comparison of the main functions provided by PROMO and by other tools

Our vision for PROMO is that it will be used as a one-stop-shop for mining clinically important insights from genomic datasets, quickly and without any need for programming skills. It accelerates the analysis process and makes it more accessible for non-computational cancer researchers. Within a single short session, the user can import a cancer dataset of interest, preprocess it, cluster its samples and features, test the sample clusters for significance using survival analysis and enrichment tests on the clinical labels, test the feature clusters for GO enrichment, identify subtype distinguishing features (biomarkers) using various statistical tests and export the results using various reports and figures. The simple classification capabilities in PROMO can automatically produce a decision tree classifier for any selected label, and thus act as a basis for a subtype diagnosis.

We intend to continue developing PROMO by adding features and supporting the tool’s users. We hope that PROMO’s comprehensiveness and ease of use will help cancer researchers make the best use of the accumulating cancer datasets to fulfill the promises of precision medicine.

Conclusions

PROMO is a powerful, user-friendly, stand-alone, publicly available tool for exploration, analysis, and interpretation of genomic cancer data together with clinical information.

Methods

Implementation

PROMO is a standalone Windows application that can support huge datasets and has a fast fully interactive graphical user interface. PROMO was written in MATLAB, and it runs over the freely available Matlab runtime environment, taking advantage of its strong computational engine and editable graphical outputs. PROMO is freely available for download at http://acgt.cs.tau.ac.il/promo/.

PROMO’s main screen (Fig. 2a) includes several key graphic elements: A large heatmap representing the currently analyzed genomic matrix is located at the center of the screen (heatmap colors correspond to the matrix values as indicated by the color scale on the right). Beneath the heatmap, a color-bar displays the currently selected sample labels. The same sample label colors will consistently be used by PROMO in all displays. The user can scroll down the list of clinical labels and explore their distribution over the samples. The panel on the left provides access to common commands and parameters. A text log that documents the analysis steps appears at the bottom of the screen. Figures 2B-F show the various panels that can be directly opened from the tab menu on the left of the screen, providing quick access to PROMO’s most useful features.

Availability and requirements

Project name: PROMO (Profiler of Multi-Omics data)

Project home page: http://acgt.cs.tau.ac.il/promo/

Operating system: Windows

Programming language: Matlab

Other requirements: Installation of Matlab runtime library R2019a (9.6)

License: GNU GPL 3.0

Any restrictions to use by non-academics: None