1 Introduction

In medicine and biology, the availability and dimensionality of multi-omics data are increasing. Initiatives such as The Cancer Genome Atlas (TCGA) have collected and made available data from more than 20,000 patient cases across more than 30 cohorts of cancer. For specific diseases, there are far fewer cases (n) than there are features (p). This is often referred to as a “big p, little n” problem, \(p>>n\). Analyses are further complicated by the right-censored nature of time-to-event data, as there may be cases where death or recurrence occurs after the latest follow-up time.

Modern techniques in machine learning (ML) modeling can be effective tools for knowledge and hypothesis generation in this \(p>>n\) setting, but intelligent decisions in dimensionality reduction and feature engineering (FE) are crucial to their effectiveness (Shi et al 2019; Rendleman et al 2019; Kuhn and Johnson 2020, Chapter 10) . To this end, the exploration, tuning, and comparison of FE approaches are necessary.

When evaluating FE techniques for modeling, the best practices involve cross validation (CV) or another resampling technique to estimate generalization (extra-sample) performance. However, resampling limits exploration of novel and experimental FE techniques, as all preprocessing must be designed and parameterized ahead of time to allow systematic application to the training data in each iteration (Kuhn and Johnson 2020, Chapter 3.4.7).

For this reason, we select a holdout (HO) approach, which has lower computational requirements but higher variability in generalization performance estimates. To counteract this variability, we propose a continuous stratified sampling method to ensure that the training and test sets are representative of the full outcome distribution.

Representative random sampling (RRS) is a sampling method applying constrained equipopulous bin stratification, where the observed distribution of the regression (or right-censored) outcome is retained in train/test splits. This is achieved by partitioning the data into equally sized bins based on the outcome variable and selecting HO sets (or CV folds) with stratified sampling on the partitions. The number of data points assigned to each bin is determined by the desired test/train split or the number of CV folds.

While bin stratification is not novel (Kuhn and Johnson 2020, Chapter 3), the current literature lacks empirical analysis of how it affects generalization performance estimation. In this paper, we characterize statistical properties of samplings generated by RRS and employ Monte Carlo simulations to determine how this continuous stratification approach influences error and bias in generalization model performance measurements.

The focus of this article is on right-censored survival outcomes, but this sampling approach can be applied to any time-to-event or continuous regression prediction setting.

2 Representative random sampling (RRS) procedure

In standard CV or HO, group assignment is performed randomly with respect to the outcome variable. With representative random sampling (RRS), a continuous stratified sampling method ensures that each group is representative of the full outcome distribution. Data points are partitioned into bins based on the continuous outcome variable, and stratified sampling is used to assign groupings for HO or CV.

The distinction between RRS and typical bin stratification lies in the definition of the bin. Instead of predefined ranges, the bins reflect the sorted order of data points. Bin boundaries are selected such that each bin contains an equal number of data points. The bin size parameter k is chosen to be as small as possible for the desired number of groupings. This minimum bin size is critical in preventing statistically significant relationships between the selected random grouping and the outcome of interest.

figure a

In Algorithm 1, k sequential data points are assigned to k separate groups in each iteration. In the case that n is not a multiple of k, the remaining \(n\bmod {k}\) data points in X can be randomly assigned to unique folds with one additional iteration. This algorithm generates folds for k-fold CV. Additionally, groupings generated with this algorithm can be used to assign data points to sets of differing size as in a holdout procedure. For example, for a one-third holdout, set \(k=3\) and randomly assign one group to be the testing set. In this work, RRS-based variants of HO and CV are denoted as RRHO and RRCV, respectively.

For the case of right-censored data, the precision of performance estimates depends on the relative fractions of observed and censored data points. Accordingly, the observed events are evenly distributed throughout the sampled sets by separately assigning the censored and uncensored data points to k groups via RRS and combining the corresponding groups. This ensures that the resulting groupings have both censored and uncensored data in the same proportions found in the full dataset and any distribution differences due to non-random censorship are retained.

3 Methods

3.1 Generalization performance estimation

Holdout (HO) and cross validation (CV) are two common ways to estimate generalization performance of a predictive model using all available data. In HO, the model is trained on some portion of the data, while the rest of the data is “held out” to be tested on after training. When you have a small number of data points, HO estimates of generalization performance can have high variance, as they strongly depend on the data points chosen for training and testing.

CV, on the other hand, is a way of systematically repeating HO such that every data point is used for testing exactly once. The data are partitioned into k folds, and the training procedure is conducted k times. In each iteration, one fold is selected to be the testing data. Afterward, model performance measures from each iteration are averaged.

As noted in Sect. 2, we refer to RRS-based variants of HO and CV as RRHO and RRCV, respectively.

Fig. 1
figure 1

Flow chart depicting the procedure for a single step in the Monte Carlo simulations, including simulated data sampling, survival model training, and generalization performance calculation and estimation. Details on the ensemble modeling approaches and performance metrics are given in Sects. 3.2 and 3.5.3, respectively

3.2 Survival modeling approaches

Ensemble models combine the predictions of multiple models to produce more accurate, robust, and reliable predictions than single learner models at the cost of interpretability (Ardabili et al 2020). Four ensemble survival modeling approaches are applied in this work: gradient boosting with linear models (GLMBoost), gradient boosted regression trees (BlackBoost), random survival forests (RFSRC), and conditional inference random survival forests (CForest).

The primary distinctions between these approaches are the base learners applied and the method of ensemble construction. GLMBoost fits a series of linear survival models and BlackBoost fits multiple survival trees, but both use a gradient boosting approach, training subsequent models to account for residuals leftover by the other base learners (Hothorn et al 2022). In contrast, RFSRC and CForest both apply a variation of Breiman’s random forest (RF) modified to handle right-censored survival data. With RFSRC, the base learners are standard survival trees (Ishwaran and Kogalur 2022), but with CForest, the models trained are conditional inference survival trees (Hothorn et al 2006; Strobl et al 2007, 2008).

When training models in the Monte Carlo simulations, we used the implementations provided in the MachineShop R package using default hyperparameters (Smith 2021).

3.3 TCGA-HNSC

Our analyses made significant use of TCGA’s head and neck squamous cell carcinoma (HNSCC) dataset, TCGA-HNSC (The Cancer Genome Atlas Network 2015). To investigate statistical properties of RRS-based samplings, right-censored survival times were used. To generate simulated data in the Monte Carlo simulations, both survival times and RNA expression variables were applied indirectly and directly, respectively.

In the survival data, approximately two-thirds of cases are censored. The distributions of censored and observed survival times do not significantly differ, but are still sampled separately with RRS. In the mRNAseq expression data, only the 520 patients with solid-tumor samples are included. Additionally, gene expression features with missing values or normalized variance \(\sigma ^2\le 0.005\) were excluded, leaving 16,628 expression features across 520 patient cases.

Fig. 2
figure 2

Distribution of logrank p-values for repeated group assignment with RRS 1/3 HO, 1/3 HO using equipopulous bin stratification with non-minimum bin sizes, and a standard 1/3 HO (denoted “random”). A vertical line indicates the significance level \(p_\textrm{logrank}=0.05\)

3.4 Repeated samplings

To investigate the general properties of one-third HO sets selected using RRS, we employed the survival data present in TCGA-HNSC. HO group assignment was performed 10,000 times with several different types of sampling, including standard random sampling, RRS, and equipopulous bin stratification with non-minimum bin sizes.

For RRS-based one-third HO, we set \(k=3\). TCGA-HNSC has 528 cases with survival data, giving a maximum bin size of \(528/k = 176\). Logrank testing was performed on each group assignment. The distributions of the resulting logrank p-values are reported per sampling approach.

3.5 Monte Carlo simulations

To examine the effects of RRS on generalization performance estimation, the methods of Borra and Di Ciaccio (2010) were adapted. A Monte Carlo simulation is employed to compare 1/3 HO and 10-fold CV with and without RRS.

3.5.1 Simulated data

To provide a rich transcriptomics dataset, 3000 simulated data points are generated using TCGA-HNSC transcriptomics data. For each data point, a total of 1000 simulated survival outcomes are generated such that they depend on the simulated data. Five-hundred of the outcomes used Weibull parameters tuned to match the survival times in TCGA-HNSC, and randomly chosen parameters generated the remaining simulated outcomes. Results for these “TCGA-like” and “General” outcomes are reported separately. Full details of simulated data generation are given in Appendix A.1.

Fig. 3
figure 3

Average error and bias values from Monte Carlo simulations when using C-index to evaluate survival models. Results are reported across multiple models and estimation approaches. Statistical comparisons were made between standard approaches (HO, CV) and their respective RRS-based variants. * indicates that for that pair of methods, differences in mean performance were statistically significant (\(p<0.05\))

3.5.2 Simulated model performance

The process described in the following paragraph is depicted in Fig. 1. For each simulated outcome, a sample of 500 of the 3000 data points is selected to be the “available” data. Several types of ML models are trained on the available data, and the remaining 2500 data points are used to calculate the models’ true generalization performance. Then, generalization performance estimation is done on the available data using four approaches: HO, RRHO, CV, and RRCV. With true and estimated generalization performance, the error and bias of the estimates can be calculated and compared (Borra and Di Ciaccio 2010).

For each simulated survival outcome, this process is repeated 30 times with different samples of 500 data points. After aggregating error and bias across all simulated outcomes and samplings (see Appendix A.2), the standard HO and CV approaches are compared to their RRS counterparts using Welch’s two-sample t-test.

3.5.3 Generalization performance metrics

To measure predictive performance of survival models, three metrics are used: mean squared error (MSE), mean absolute error (MAE), and concordance index (C-index). MSE and MAE depend on the scale of the output, measuring the variance and average of the residuals, respectively. In contrast, the C-index is a measure of how effective a model is at relative ordering of outputs; intuitively, it is the probability that the predictions of any two data points are correctly ordered. Notably, MSE and MAE are only defined with uncensored cases, rendering them less reliable when the number of right-censored data points is high. As described in Sect. 3.3, approximately two-thirds of simulated and TCGA-HNSC data points are censored, making the C-index most reliable.

4 Results and discussion

4.1 RRS samplings

The logrank test is typically used to compare time-to-event distributions between populations, where \(p_\textrm{logrank}<0.05\) indicates statistically significant differences. Figure 2 reports the distribution of \(p_\textrm{logrank}\) values from logrank tests applied to 10,000 repetitions of each sampling approach. It illustrates how equipopulous bin-stratified sampling restricts the set of possible random samplings to a subset that avoids statistically significant relationships between group assignment and time-to-event outcome.

The strength of this effect depends on the size of the bins used for stratified sampling. As bin size decreases, the \(p_\textrm{logrank}\) values associated with test/train groupings increase.

As expected for a fully random 1/3 HO, approximately \(5\%\) of the samplings yield \(p_\textrm{logrank}<0.05\). With a minimum bin size equipopulous bin stratification (RRS) 1/3 HO, no samplings exhibited statistically significant differences. Even in the case of maximum bin size, the percentage of possible samplings with statistically significant differences was reduced to \(0.1\%\).

This shows that RRS, using minimum bin size, maximizes similarity in the distribution of outcomes across HO sets or CV folds selected using equipopulous bin-stratified sampling.

Table 1 Relative error (mean rRMSE) when using C-index to evaluate survival models
Table 2 Relative bias (mean arb \(\times 10^{-3}\), for brevity) when using C-index to evaluate survival model types

4.2 Monte Carlo simulations

Figure 3 reports the error and bias for C-index estimates across all simulations and model types. The raw data are also provided in Tables 1 and 2 for the error and bias, respectively.

RRS-based estimation variants consistently yielded statistically significant reductions in average relative bias in nearly all cases. Across HO estimation approaches, boosted models exhibited the least bias. Application of RRS to CV estimation reduced the bias of RF-based models to be on par with the least biased model type, GLMBoost.

For average error, RF-type models saw modest but significant decreases with CV. Error reduction with HO approaches was less consistent across the two sets of simulated survival outcomes, but statistical significance was found with both RFSRC and GLMBoost. Overall, GLMBoost gave the most accurate model performance estimates.

When estimating MAE and MSE, statistically significant reductions in error and bias tended to be with the RF-based models. Additionally, RF-type models were more reliable on average than the boosting-based models for these metrics. As previously noted, these metrics are less reliable than C-index in this setting, as they are only defined for uncensored data points.

5 Conclusions

RRS is a continuous stratified sampling technique that minimizes statistically significant relationships between random group assignments and a time-to-event or continuous outcome. As a bin stratification approach, it is unique in applying equipopulous bins of minimum size.

RRS has valuable applications in train/test sampling for model performance evaluation. Simulations show that on average, when compared with standard HO and CV approaches, it gives statistically significant reductions in average error and bias metrics across several types of models and performance metrics. Notably, we did not observe increases in these metrics on average with the application of RRS.

This enables more reliable HO-based model performance estimation, in turn improving the credibility of hypotheses and knowledge generated by model analyses. In “big p, little n” data settings, RRS will allow data-driven FE approaches to be explored via ML modeling with higher confidence. RRS can also improve reliability of k-fold CV performance estimates in contexts where repeated CV is infeasible or k is small.

The logical next steps of this work will involve the application of RRHO to FE problems in cancer genomics. This will facilitate the evaluation of FE methods via HO performance estimates and highlight approaches that are effective at extracting meaningful signal in this \(p>>n\) setting. Additional comparisons among RRCV, the number of CV folds, and repeated CV could be fruitful, as they may distinguish other suitable contexts for the application of RRS.