Introduction

Biomarker identification is an important goal of cancer research for clinicians and biologists. How to explore specific biomarkers that can distinguish tumoral from normal tissues, identify treatment-resistant patients, predict patient prognosis and recurrence, etc., are routine research tasks. Recently, immunotherapies represented by immune checkpoint inhibitors have opened a new era in cancer treatment, significantly improving the clinical outcomes of cancer patients [1]. However, only a small fraction of patients can generate considerable benefits from immunotherapies [2]. Exploring specific biomarkers that can effectively predict immunotherapeutic efficacy is crucial for preventing under- or over-treatment.

With the advancement of bioinformatics techniques, researchers are inclined to explore cancer biomarkers using RNA-seq or microarray data [3, 4], and data mining has become an essential part of cancer research. However, these works may be difficult and inconvenient for clinicians and biologists without computational programming skills. Currently, several open-access web servers that allow users to analyze and visualize gene expression online directly are emerging, such as GEPIA [5], Xena [6], ExpressionAtlas [7], and HPA [8]. Although these web applications are valuable and broadly utilized, obtaining high confidence results in a specific tumor is difficult because their data sources are mainly derived from the TCGA database. Consistent performance across multiple independent datasets is the foundation for an excellent biomarker. In addition, the deeper exploration of specific biomarkers on underlying mechanisms, tumor microenvironment, and drug indications are missing in these tools.

To address these unmet needs, we have developed Biomarker Exploration for Solid Tumors (BEST), a web-based application for comprehensive biomarker exploration on large-scale data in solid tumors and delivering fast and customizable functionalities to complement existing tools.

Methods

Data collection

BEST is committed to identifying robust tumor biomarkers through large-scale data. Hence, we retrieved cancer datasets with both expression data and important clinical information (e.g., survival, therapy, etc.) as much as possible. Eligible datasets were mainly enrolled from five databases, including The Cancer Genome Atlas Program (TCGA, https://portal.gdc.cancer.gov), Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), International Cancer Genome Consortium (ICGC, https://dcc.icgc.org), Chinese Glioma Genome Atlas (CGGA, http://www.cgga.org.cn/), and ArrayExpress (https://www.ebi.ac.uk/arrayexpress/). In total, we included more than 50,000 samples from 64 datasets for 27 cancer types.

Data re-annotation and pre-processing

Raw expression data were extracted for subsequent processing (Fig. 1). Data were re-annotated if the original probe sequences were available based on the GRCh38 patch 13 sequences reference from GENCODE (https://www.gencodegenes.org/). For RNA-seq data, raw count read was converted to transcripts per kilobase million (TPM) and further log-2 transformed. The raw microarray data from Affymetrix®, Illumina®, and Agilent® were processed using the affy [9], lumi [10], and limma [11] packages, respectively. The normalized matrix files were directly downloaded for microarray data from other platforms. Gene expression was further transformed into z-score across patients in each dataset. To make it easier for users to interpret and present analysis results, we cleaned and unified the clinical traits. Take KRAS mutation as an example, GSE39084 [12] named it ‘kras.gene.mutation.status’, ‘mutation’ was labeled ‘M’ and ‘wild type’ was labeled ‘WT’; whereas GSE143985 [13] named it ‘kras_mutation’, ‘mutation’ was labeled ‘Y’ and ‘wild type’ was labeled ‘N’. We uniformly termed it ‘KRAS’, and ‘mutation’ was labeled ‘Mut’ and ‘wild type’ was labeled ‘WT’.

Fig. 1
figure 1

Overview of the BEST analytical framework

Data calculation and storage

A tremendous amount of calculations are involved in BEST analysis, we thus have completed the time-consuming calculations in advance and used R.data for storage. Users can directly call these data, significantly reducing the user’s waiting time and background computing pressure. Take colorectal cancer (CRC) as an example, we collected a total of 47 datasets. Drug assessment is an analysis module of BEST, which requires fitting ridge regression models for individual drugs based on drug responses and expression data of cancer cell lines from the Genomics of Drug Sensitivity in Cancer_v1 (GDSC_v1), Genomics of Drug Sensitivity in Cancer_v2 (GDSC_v2), The Cancer Therapeutics Response Portal (CTRP), and Profiling Relative Inhibition Simultaneously in Mixtures (PRISM) databases, and then predicting the sensitivity of each drug for CRC samples from all collected datasets. Apparently, if these results are not calculated in advance, users may have to wait more than 3 days. The pre-calculated content is displayed in Fig. 1.

Implementations

BEST is entirely free for users, built by the Shiny app and the HTML5, CSS, and JavaScript libraries for the client-side user interface. The Shiny app (version: 1.7.2) mainly executes data processing and analysis. The function of BEST is divided into eight tabs (Fig. 1): Clinical association, Survival analysis, Enrichment analysis, Cell infiltration, Immunomodulator, Immunotherapy, Candidate agents, and Genomic alteration. Analysis results include images and tables, images can be downloaded in portable document format (PDF) and portable network graphics (PNG) format, and tables can be obtained in comma-separated value (CSV) format.

Results

Quick start

BEST offers a simple interactive interface. Users first select one cancer type and then determine the input category—single gene or gene list (Fig. 1). For the single-gene module, users can enter a gene symbol or an Ensembl ID in the ‘Enter gene name’ field to explore a gene of interest. The gene list module needs users to input a list of genes and pick a method to calculate the gene set score for each sample. The embedded methods include gene set variation analysis (GSVA) [14], single sample gene set enrichment analysis (ssGSEA) [15], z-score [16], pathway-level analysis of gene expression (PLAGE) [17], and the mean value. Users can customize the name of the gene set score.

Clinical association

In this module, users can explore the associations between the expression or score of the input variable and general characteristics (e.g., age, gender, alcohol, smoke, etc.), histological characteristics (e.g., tissue type, tumor site, stage, etc.), molecular characteristics (e.g., TP53 mutation, microsatellite instability, etc.) and treatment responses (e.g., chemotherapy and bevacizumab responses, etc.) (Fig. 2A). Whether to use parametric or nonparametric statistical tests for group comparisons based on the distribution of input variable [18]. For example, users can easily explore the differential expression of the input variable between tumor and normal tissues or find the associations between the input variable with smoke and alcohol. Our datasets also include abundant treatment responses, which might contribute to developing promising biomarkers in clinical settings. Importantly, analysis results tend to be displayed in multiple independent cohorts, which provides a reference for the stability power of a variable of interest. For instance, Fig. 2B illustrates that CRC tumors process a significantly higher expression of COL1A2 than normal tissues in most CRC datasets with tissue type information.

Survival analysis

BEST performs survival analysis based on gene expression or gene set score. This module allows users to explore the prognostic significance for overall survival (OS), disease-free survival (DFS), relapse-free survival (RFS), progression-free survival (PFS), and disease-specific survival (DSS) (Fig. 2C). BEST generates Kaplan–Meier curves with log-rank test and forest plot with cox proportional hazard ratio and the 95% confidence interval information for various survival outcomes in multiple independent datasets (Fig. 2C, D). Kaplan–Meier analysis requires categorical variables, we thus provide five cutoff options for users to choose from, including ‘median’, ‘mean’, ‘quantile’, ‘optimal’, and ‘custom’. For example, when investigating gene COL1A2 in survival analysis of CRC, users can obtain Kaplan–Meier curves with a specific cutoff approach and a Cox forest plot for five survival outcomes across all CRC datasets with survival information.

Fig. 2
figure 2

Modules for clinical association and survival analysis. A Four categories of clinicopathologic information are mainly included in the clinical association module. B An example illustrates the differential expression of COL1A2 in multiple CRC datasets between the tumor and normal groups. C Five categories of survival variables are utilized in the survival analysis module, and examples of five survival variables for Kaplan–Meier analysis. D An example displays the cox regression analysis of five survival variables in multiple CRC datasets

Enrichment analysis

BEST provides two enrichment frameworks: over-representation analysis (ORA) [19] and gene set enrichment analysis (GSEA) [20]. Users can select the top gene (self-defined number) most associated with the input variable to perform ORA and apply a ranked gene list based on the correlation between all genes and the input variable to carry out GSEA (Fig. 3A). Of note, the final correlation coefficient between the input variable and each gene is the average correlation of all datasets in specific cancer. The Pearson correlation was calculated between all genes and the input variable. If users input a gene list, which will be firstly calculated by one of the four provided algorithms, including gene set variation analysis (GSVA), single sample gene set enrichment analysis (ssGSEA), z-score, pathway-level analysis of gene expression (PLAGE), and the mean value. The output forms of ORA are GO and KEGG bar charts (Fig. 3B, C). The ‘Detected Genes’ are all top gene most related to the input variable, which are also existed in GO or KEGG gene sets. The ‘Enriched Genes’ are the top gene within the specific biological pathway. Also, GSEA results are exhibited using GSEA-Plot (Fig. 3D) and Ridge-Plot (Fig. 3E) images. The GO, KEGG, and Hallmark gene sets for GSEA are obtained from Molecular Signatures Database (MSigDB). Similarly, users could select single gene or gene list as input variable. The specific biological term of GO, KEGG, and Hallmark gene set could be shown as GSEA-Plot, or a series of biological terms could be displayed as Ridge-Plot.

Fig. 3
figure 3

Enrichment analysis module. A Enrichment analysis module includes two enrichment frameworks: over-representation analysis (ORA) and gene set enrichment analysis (GSEA). B, C GO (B) and KEGG (C) bar charts for the ORA framework. D, E GSEA-Plot (D) and Ridge-Plot (E) examples for the GSEA framework

Cell infiltration and immunomodulator

BEST offers eight prevalent algorithms to estimate immune cell infiltration in the tumor microenvironment (TME) (Fig. 4A), including CIBERSORT [21], CIBERSORT ABS [21], EPIC [22], ESTIMATE [23], MCP-counter [24], Quantiseq [25], TIMER [26], and xCell [27]. To avoid time-consuming calculations for users and save computing resources, these eight algorithms have been executed in advance across all datasets, and the resulting data have been stored in the website background. Additionally, BEST provides five immunomodulator categories: antigen presentation, immunoinhibitors, immunostimulators, chemokines, and receptors (Fig. 4A). Users can generate heatmap and correlation scatter plots from these two analysis modules. The heatmaps illustrate the correlations of the input variable with each immune cell/immunomodulator across all cohorts (Fig. 4B, C), and the correlation scatters plots indicate the correlation of the input variable and an immune cell/immunomodulator in a specific dataset (Fig. 4D, E).

Fig. 4
figure 4

Modules for cell infiltration and immunomodulator analysis. A BEST offers eight immune infiltration assessment algorithms and five categories of immunomodulators. B, C Heatmaps illustrate the correlations of the input variable with each immune cell (B) or immunomodulator (C) across all CRC datasets. D, E Correlation scatters plots indicate the correlation of the input variable and an immune cell (D) or immunomodulator (E) in a specific dataset

Immunotherapy

To further investigate the clinical significance of the input variable in immunotherapies, we retrieved 19 immunotherapeutic cohorts with expression data and immunotherapy information (e.g., CAR-T, anti-PD-1, anti-CTLA4, etc.) (Fig. 5A). Based on gene expression or gene set score in these datasets, users can conduct differential expression analysis (DEA) between response and non-response groups (Fig. 5B), receiver operating characteristic (ROC) curve to evaluate the performance of the input variable in predicting the immunotherapeutic efficacy (Fig. 5C), and survival analysis to assess the impact of the input variable on survival (OS and PFS) in immunotherapeutic cohorts that have undergone immunotherapies (Fig. 5D).

Fig. 5
figure 5

Immunotherapy analysis module. A Schema describing data details and analysis for the immunotherapy module. B Boxplots indicate the differential expression of COL12A between response and non-response groups. C Receiver operating characteristic (ROC) curves evaluate the performance of the input variable in predicting the immunotherapeutic efficacy. D Kaplan–Meier curves assess the impact of the input variable on survival (OS and PFS) in immunotherapeutic cohorts that have undergone immunotherapies

Candidate agents

In this analysis tab, BEST performs drug assessment in bulk samples based on drug responses and expression data of cancer cell lines from the GDSC_v1, GDSC_v2, CTRP, and PRISM databases (Fig. 6A). Given the inherent differences between bulk samples and cell line cultures, we introduced a correlation of correlations framework [28] to retain genes presenting analogical co-expression patterns in bulk samples and cell lines. As previously reported [29], the model used for predicting drug response was the ridge regression algorithm implemented in the oncoPredict package [30]. This predictive model was trained on transcriptional expression profiles and drug response data of cancer cell lines with a satisfied predictive accuracy were evaluated by default 10-fold cross-validation, thus allowing the estimation of clinical drug response using only the expression data of bulk samples (Fig. 6A). Modeling and prediction works have been completed, and drug assessments of all tumor samples based on four databases have been stored in the website background. BEST will calculate the correlations between all drugs and the input variable in all cohorts. According to the correlation rank of each drug across all datasets, we applied the robust rank aggregation (RRA) [31] to determine drug importance related to the input variable (Fig. 6B). Users can select the top drugs (self-defined number) to display the heatmap that illustrates the correlations of the input variable with each drug across all cohorts. Higher-ranked drugs indicate that high levels of the input variable predict drug resistance and vice versa. For example, high expression of COL1A2 might suggest Afatinib resistance and Dasatinib sensitivity based on the GDSC_v2 database (Fig. 6C). Also, users can select a drug database, a tumor dataset, and a specific drug to generate a correlation scatter plot (Fig. 6D).

Fig. 6
figure 6

Candidate agents module. A Overview of BEST performing drug assessment. B Details of obtaining drugs that are significantly associated with the input variable. C Heatmaps illustrate the correlations of the input variable with each drug across all cohorts. Higher-ranked drugs indicate that high levels of the input variable predict drug resistance and vice versa. D Correlation scatters plots indicate the correlation of the input variable and a drug in a drug database and a tumor dataset

Genomic alteration

In this module, BEST has pre-processed mutation and copy number variation data from the TCGA database using maftools [32] and GISTIC2.0 [33], respectively. Users can obtain a heatmap indicating genomic alterations as the input variable increase. The right panel of heatmap also displays the proportion of genomic alteration and statistical differences between the high and low groups. For example, with the rise in COL1A2 expression, the genomics landscape of the TCGA-CRC dataset is illustrated in Fig. 7. We found that the loss of chromosome segment 1p13.2 was more frequent in the high expression group.

Fig. 7
figure 7

Genomic alteration analysis module

Discussion

As an interactive web tool, BEST aims to explore the clinical significance and biological functions of cancer biomarkers through large-scale data. Therefore, data richness is the foundation of BEST. From data collection, re-annotation, pre-processing, and pre-calculation to storage, we provide a tidy and uniform pan-cancer database, allowing users to call and interpret data quickly. BEST offers prevalent analysis modules to enable researchers without computational programming skills to conduct various bioinformatics analyses. Compared with other available tools [5,6,7,8, 34,35,36], BEST has more datasets and more diverse analysis options, which complements well with them (Table 1).

Table 1 Comparison of BEST with other tools

In BEST web application, users can identify cancer biomarkers associated with critical clinical traits (e.g., stage and grade), prognosis, and immunotherapy. Moreover, the underlying mechanisms of these biomarkers could be further explored using the enrichment, cell infiltration, and immunomodulator analysis modules. Users can also apply the candidate agent analysis tab to investigate high levels of cancer biomarkers that might indicate which drugs are resistant and which are sensitive to specific cancer.

Taken together, BEST provides a curated database and innovative analytical pipelines to explore cancer biomarkers at high resolution. It is an easy-to-use and time-saving web tool that allows users, especially clinicians and biologists without background knowledge of bioinformatics data mining, to comprehensively and systematically explore the clinical significance and biological function of cancer biomarkers. With constant user feedback and further improvement, BEST is promising to serve as an integral part of routine data analyses for researchers.