# Sparse classification with paired covariates

- 217 Downloads

## Abstract

This paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same data are available. It is often unknown which of the two covariate sets leads to better predictions, or whether the two covariate sets complement each other. The paired lasso addresses this problem by weighting the covariates to improve the selection from the covariate sets and the covariate pairs. It thereby combines information from both covariate sets and accounts for the paired structure. We tested the paired lasso on more than 2000 classification problems with experimental genomics data, and found that for estimating sparse but predictive models, the paired lasso outperforms the standard and the adaptive lasso. The R package palasso is available from cran.

## Keywords

Prediction Sparsity Lasso regression Paired data## Mathematics Subject Classification

62-04 62J12 62J07 62H30 62P10## 1 Background

Lasso regression has become a popular method for variable selection and prediction. Among other things, it extends generalised linear models to settings with more covariates than samples. The lasso shrinks the coefficients towards zero, setting some coefficients equal to zero. Compared to the standard lasso, the adaptive lasso shrinks large coefficients less. In high-dimensional spaces, most coefficients are set to zero, since the number of non-zero coefficients is bounded by the sample size (Zou and Hastie 2005). It is possible to decrease the maximum number of non-zero coefficients, and estimate the coefficients given this sparsity constraint. By including fewer covariates, the resulting model may be less predictive but more practical and interpretable. Given an efficient algorithm that produces the regularisation path, we can extract models of different sizes without increasing the computational cost.

Paired covariates arise in many applications. Possible origins include two measurements of the same attributes, and two transformations of the same measurements. The covariates are then in two sets, with each covariate in one set forming a pair with a covariate in the other set. These covariate sets may be strongly correlated. Naively, we could either exclude one of the two sets or ignore the paired structure. However, we want to include both sets, and account for the paired structure. Such a compromise potentially improves predictions.

Our motivating example is to predict a binary response from microrna isoform (isomir) expression quantification data. Micrornas help to regulate gene expression and are dysregulated in cancer. Typically, most raw counts from such sequencing experiments equal zero. Different transformations of rna sequencing data lead to different predictive abilities (Zwiener et al. 2014), and knowledge about the presence or absence of an isomir might be more predictive than its actual expression level (Telonis et al. 2017). We hypothesise that combining two transformations of isomir data, namely a count and a binary representation, improves predictions. We also analysed other molecular profiles to show the generality of our approach.

The paired lasso, like the group lasso (Yuan and Lin 2006) and the fused lasso (Tibshirani et al. 2005), is an extension of the lasso for a specific covariate structure. If the covariates are split into groups, we could use the group lasso to select groups of covariates. If the covariates have a meaningful order, we could use the fused lasso to estimate similar coefficients for close covariates. And if there are paired covariates, we recommend the paired lasso to weight among and within the covariate pairs.

Our aim is to create a sparse model for paired covariates. The paired lasso exploits not only both covariate sets but also the structure between them. We demonstrate that it outperforms the standard and the adaptive lasso in a number of settings, while also showing its limitations.

In the following, we introduce paired covariate settings and the paired lasso (Sect. 2), classify cancer types based on two transformations of the same molecular data (Sect. 3), discuss sparsity constraints and potential applications to other paired settings (Sect. 4), and predict survival from gene expression in tumour and normal tissue (see appendix).

## 2 Method

### 2.1 Setting

*n*samples, one response and twice

*p*covariates. We allow for continuous, discrete, binary and survival responses. We assume all covariates are standardised, and the setting is high-dimensional (\({p \gg n}\)). Let the \({n \times 1}\) vector \({\varvec{y}}\) represent the response, the \({n \times p}\) matrix Open image in new window the first covariate set, and the \({n \times p}\) matrix Open image in new window the second covariate set:The one-to-one correspondence between Open image in new window and Open image in new window gives rise to paired covariates. In practice, the two covariate sets may represent different transformations of the same data. For each

*j*in \(\{1,\ldots ,p\}\), the \({n \times 1}\) covariate vectors Open image in new window and Open image in new window represent one covariate pair.

*i*in \(\{1,\ldots ,n\}\) equalswhere \(\alpha \) is the unknown intercept, and Open image in new window and Open image in new window are the unknown regression coefficients. We want to estimate a model with a limited number of non-zero coefficients (e.g. Open image in new window). Our ambition is to select the most predictive model given such a sparsity constraint. Although additional covariates could improve predictions, many applications require small model sizes.

### 2.2 Paired lasso

For the standard and the adaptive lasso, we have to decide whether the model should exploit Open image in new window, Open image in new window, or both. If we included only one covariate set, we would loose the information in the other covariate set. If we included both covariate sets, we would double the dimensionality and still ignore the paired structure. In contrast, the paired lasso exploits both covariate sets, and accounts for the paired structure.

**(1)**within covariate set Open image in new window,

**(2)**within covariate set Open image in new window,

**(3)**among all covariates, or

**(4)**among and within covariate pairs. The tuning parameter Open image in new window determines the weighting scheme. Each Open image in new window in Open image in new window leads to different weights Open image in new window and Open image in new window for covariates Open image in new window and Open image in new window, for any pair

*j*:where Open image in new window and Open image in new window are some initial estimates (see below). Figure 1 illustrates the four weighting schemes, by showing the sets of weights emanating from some initial estimates. The first three schemes are fallbacks to the adaptive lasso based on Open image in new window (Open image in new window), Open image in new window (Open image in new window), or both (Open image in new window). The pairwise-adaptive scheme (Open image in new window) is novel: it weights among and within covariate pairs. It depends on the data which weighting scheme leads to the most predictive model.

Exploiting the efficient procedure for penalised maximum likelihood estimation from glmnet (Friedman et al. 2010), we use internal cross-validation to select \(\lambda \) from 100 candidates, and to select Open image in new window from four candidates. To avoid overfitting, we estimate the weights in each internal cross-validation iteration. The tuning parameter Open image in new window governs the type of weighting, and the tuning parameter \(\lambda \) determines the amount of regularisation. Despite the covariate-specific penalty factors, the paired lasso is only four times as computationally expensive as the standard lasso. Unlike cross-validating the weighting scheme, cross-validating all weights in Open image in new window and Open image in new window would be computationally infeasible and likely prone to overfitting.

### 2.3 Initial estimators

Inspired by the adaptive lasso (Zou 2006), we estimate the effects of the covariates on the response in two steps, obtaining the initial and the final estimates from the same data. Suggested initial estimates for the adaptive lasso in high-dimensional settings include absolute coefficients from ridge (Zou 2006), lasso (Bühlmann and van de Geer 2011) and simple (Huang et al. 2008) regression. Marginal estimates have several advantages over conditional estimates. First, estimating conditional effects is hard in high-dimensional settings with strongly correlated covariates. Conditional estimation strongly depends on the type of regularisation. Second, estimating marginal effects is computationally more efficient than estimating conditional effects. Third, we can easily improve the quality of the marginal estimates by empirical Bayes, because standard errors are available (Dey and Stephens 2018).

Although marginal and conditional effects of covariates may differ strongly, we conjecture covariates with strong marginal effects tend to be conditionally more important than those with weak marginal effects. Using the same hypothesis, Fan and Lv (2008) showed that reducing dimensionality by screening out covariates with weak marginal effects can improve model selection. For each combination of two covariates, we conjecture the one with the greater absolute correlation coefficient is conditionally more important than the other. Instead of comparing all coefficients at once, we compare them within the first covariate set, within the second covariate set, among all covariates, and simultaneously among and within the covariate pairs. These comparisons correspond to the four weighting schemes.

## 3 Results

We tested the paired lasso in 2048 binary classification problems. In each classification problem, we used one molecular profile to classify samples into two cancer types. Our paired covariates consist of two representations of the same molecular profile. We compared the paired lasso with the standard and the adaptive lasso.

### 3.1 Classification problems

Molecular tumour markers may improve cancer diagnosis, cancer staging and cancer prognosis. One may analyse blood or urine samples to detect cancer, classify cancer subtypes, predict disease progression, or predict treatment response. Because too few liquid biopsy data are available for reliably evaluating prediction models, we analyse tissue samples to classify cancer types, as a proof of concept. This is less clinically relevant, but allows a comprehensive comparison of models. The challenge is to select a small subset of features with high predictive power.

The Cancer Genome Atlas (tcga) provides genomic data for more than 11,000 patients. From the harmonised data, we retrieved gene expression quantification, microrna isoform (isomir) expression quantification, microrna (mirna) expression quantification, and “masked” copy number segments with TCGAbiolinks (Colaprico et al. 2016). Data are available for 19,602 protein-coding genes, 197,595 isomirs, and 1881 mirnas. The transcriptome profiling data are counts, and the copy number variation (cnv) data are segment mean values. We extracted the segment mean values at 10,000 evenly spaced chromosomal locations. The samples come from different types of material. We included primary solid tumour samples for all cancer types available, except in the case of leukaemia, where we included peripheral blood samples. For patients with replicate samples, we randomly chose one sample.

We used double cross-validation with 10 internal and 5 external folds to tune the parameters and to estimate the prediction accuracy, respectively. In the outer cross-validation loop, we repeatedly \((5\times )\) split the samples into four external folds for training and validation \((80\%)\), and one external fold for testing \((20\%)\). In the inner cross-validation loop, we repeatedly \((10\times )\) split the samples for training and validation into nine inner folds for training \((72\%)\) and one inner fold for validation \((8\%)\). Training samples serve for estimating the coefficients Open image in new window and Open image in new window, validation samples for tuning the parameters \(\lambda \) and Open image in new window, and testing samples for measuring the predictive performance. As a loss function for logistic regression, we chose the deviance \(-2 \sum _{i=1}^n \{ y_i \log {(p_i)} + {(1-y_i)} {\log (1-p_i)} \}\), where \(y_i\) and \(p_i\) are the observed response and the predicted probability for individual *i*, respectively. Although we minimised the deviance to tune the parameters, we also calculated the area under the receiver operating characteristic curve (auc) and the misclassification rate to estimate the prediction accuracy. Since indirect maximisation might lead to suboptimal aucs (Cortes and Mohri 2004), we prefer the deviance as a primary evaluation metric.

### 3.2 Paired covariates

Transcriptome profiling data require some preprocessing. We preprocessed the expression counts for each cancer–cancer combination separately, using the same procedure for genes, isomirs and mirnas. The total raw count for an individual is its library size, and the total raw count for a transcript is its abundance. We used the trimmed mean normalisation method from edgeR (Robinson and Oshlack 2010) to adjust for different library sizes, and filtered out all transcripts with an abundance smaller than the sample size. This filtering removes non-expressed transcripts and lets the dimensionality increase with the sample size. Furthermore, we Anscombe-transformed the normalised expression counts (\(x \rightarrow {2\sqrt{x + 3/8}}\)).

Gene expression: Shmulevich and Zhang (2002) binarise microarray gene expression data by separating low and high expression values with an edge detection algorithm. For each gene

*j*, we sorted the normalised counts in ascending order Open image in new window, and calculated the differences between consecutive values Open image in new window. Maximising \({H(i/n)} d_{ij}\) with respect to*i*, where \(H(\cdot )\) is the binary entropy function, we obtained the cutoff Open image in new window. The binary covariate Open image in new window indicates whether the continuous covariate Open image in new window is above this cutoff Open image in new window.Isomir and mirna expression: Telonis et al. (2017) binarise isomir data by labelling the bottom \({80\%}\) and top \({20\%}\) most expressed isomirs of a sample as “absent” or “present”, respectively. Because we analysed samples from only two cancer types at a time, and filtered out low-abundance transcripts, this binarisation procedure would be unstable. Instead, we let the binary covariate matrix Open image in new window indicate non-zero expression counts.

Copy number variation: If

*c*is a copy number, the corresponding segment mean value equals \({\log _2 (c/2)}\). Negative and positive values indicate deletions or amplifications, respectively. Without introducing lower and upper bounds, we only assigned values equalling zero to the diploid category. Accordingly, the ternary covariate matrix Open image in new window indicates the signs of the segment mean values.

*j*is represented by both Open image in new window and Open image in new window. Preparing for penalised regression, we transformed all covariates to mean zero and unit variance.

### 3.3 Predictive performance

Natural competitors for the paired lasso are the standard and the adaptive lasso. We compared the paired lasso, exploiting both Open image in new window and Open image in new window, with six competing models: the standard and the adaptive lasso exploiting either Open image in new window, Open image in new window, or both. We strive for very sparse models, as often desired in clinical practice. For now, each model may include up to 10 covariates.

The next step is to test whether the paired lasso is significantly better than the competing models. For each molecular profile and each competing model, we calculated the difference in deviance between the paired lasso and the competing model. A setting with *k* cancer types leads to \({k \atopwithdelims ()2}\) differences in deviance. However, these values are mutually dependent because of the overlapping cancer types. We therefore cannot directly test whether they are significantly different from zero. Instead, we accounted for their dependencies.

*p*values with Simes combination test (Westfall 2005). This combination leads to one

*p*value for each molecular profile and each competing model (Table 1). At the \({5\%}\) level, 22 out of 24 combined

*p*values are significant. The insignificant improvements occur for gene expression with the adaptive lasso based on Open image in new window, and cnv with the adaptive lasso based on Open image in new window. We conclude that for these data the paired lasso is significantly better than the competing models.

Combined *p* values

Standard | Adaptive | |||||
---|---|---|---|---|---|---|

gene | 0.0003 | 0.0035 | 0.0034 | 0.0024 | 0.0242 | |

isomir | 0.0003 | 0.0011 | 0.0010 | 0.0021 | 0.0091 | 0.0147 |

mirna | 0.0003 | 0.0003 | 0.0003 | 0.0305 | 0.0010 | 0.0066 |

cnv | 0.0003 | 0.0003 | 0.0003 | 0.0011 | 0.0096 |

### 3.4 Weighting schemes

Selected weighting schemes

gene | 0.21 | 0.33 | 0.32 | 0.14 |

isomir | 0.26 | 0.25 | 0.21 | 0.28 |

mirna | 0.36 | 0.10 | 0.26 | 0.29 |

cnv | 0.31 | 0.15 | 0.17 | 0.37 |

Subject to at most 10 non-zero coefficients, the paired lasso has a better predictive performance than the standard and the adaptive lasso based on Open image in new window and/or Open image in new window. We repeated cross-validation with tighter and looser sparsity constraints. As the maximum number of non-zero coefficients increases, the differences between the paired lasso and the competing models decrease (Fig. 7). Alleviating the sparsity constraint allows the competing models to include more or all relevant predictors. This improves classifications, leaves less room for further improvements, and makes the pairwise-adaptive weighting less important. Nevertheless, without a sparsity constraint, the paired lasso leads to much sparser models than the standard lasso (Table 3).

Average numbers of non-zero coefficients

Standard | Adaptive | Paired | Elastic | |||||
---|---|---|---|---|---|---|---|---|

gene | 31 | 22 | 21 | 20 | 17 | 17 | 18 | |

isomir | 33 | 31 | 28 | 20 | 19 | 18 | 18 | |

mirna | 26 | 38 | 28 | 16 | 21 | 16 | 16 | |

cnv | 83 | 110 | 105 | 51 | 78 | 63 | 61 |

## 4 Discussion

We developed the paired lasso for estimating sparse models from paired covariates. It handles situations where it is unclear whether one covariate set is more predictive than the other covariate set, or whether both covariate sets together are more predictive than one covariate set alone.

Under a sparsity constraint, the paired lasso can have a better predictive performance than the standard and the adaptive lasso based on Open image in new window and/or Open image in new window. In our comparisons, the standard and the adaptive lasso each have three chances to beat the paired lasso: exploiting Open image in new window, Open image in new window, or both. Nevertheless, the paired lasso, automatically choosing from Open image in new window and Open image in new window, improves the best standard and the best adaptive lasso.

This improvement stems from introducing a pairwise-adaptive weighting scheme and choosing among multiple weighting schemes. A super learner (van der Laan et al. 2007) would combine predictions from multiple weighting schemes, improving predictions at the cost of interpretability. In contrast, the paired lasso attempts to select the most predictive combination of covariate sets, and the most predictive covariates.

Sparsity constraints should be employed regardless of whether the underlying effects are sparse or not. Their purpose is to make models as sparse as desired. Even if numerous covariates influence the response, we might still be interested in the top few most influential covariates. For example, a cost-efficient clinical implementation may require a limited number of markers. But if the standard lasso without a sparsity constraint returns a sufficiently sparse model, the sparsity constraint is redundant.

The paired lasso uses the response twice, first for weighting the covariates, and then for estimating their coefficients. This two-step procedure increases the weight of presumably important covariates, and decreases the weight of presumably unimportant covariates. Therefore, without an effective sparsity constraint, the paired lasso tends to sparser models than the standard lasso, and with an effective sparsity constraint, the paired lasso tends to more predictive models than the standard lasso.

Molecular profiles with

*meaningful thresholds*also include exon expression and dna methylation. Exons can have different types of effects on a clinical response. Some exons are retained for some samples, but spliced out for other samples. Other exons are retained for all samples, but with different expression levels. Both the change from “non-expressed” to “expressed” and the expression level might have an effect. We could match zero-indicators with count covariates to account for both types of effects. Similarly, beyond considering cpg islands as unmethylated or methylated, we could also account for methylation levels.Some molecular profiles lead to

*categorical variables*with three or more levels. Single nucleotide polymorphism (snp) genotype data take the values zero, one and two minor alleles. Depending on the effect of interest, we would normally construct indicators for “one or two minor alleles” to analyse dominant effects, indicators for “two minor alleles” to analyse recessive effects, or quantitative variables to analyse additive effects. Instead, we could include both indicator groups to account for all three types of effects. Similarly, we could represent cnv data as two sets of ternary covariates, the first indicating losses and gains, and the second indicating great losses and great gains.Another source of paired covariates are

*repeated measures*. If the same molecular profile is measured twice under the same conditions, the average might be a good choice. But less so if the same molecular profile is measured under different conditions. Then it might be better to match the repeated measures. An interesting application is to predict survival from gene expression in tumour (Open image in new window) and normal (Open image in new window) tissue collected from the vicinity of the tumour (Huang et al. 2016). We compared the paired lasso with the standard and the adaptive lasso based on Open image in new window and/or Open image in new window (see appendix). For at least five out of six cancer types, the paired lasso fails to improve the cross-validated predictive performance. We argue that sparsity might be a wrong assumption for these data, in particular for the survival response, which may be better accommodated by dense predictors like ridge regression (van Wieringen et al. 2009). Indeed, the standard lasso generally selects few or no variables for four cancer types. Moreover, adaptation fails to improve the standard lasso for another cancer type, leaving little room for improvement to the paired lasso, which is essentially a bag of adaptive lasso models. Finally, for one cancer type, the paired lasso is competitive with the adaptive lasso based on tumour tissue, both performing relatively well. The paired lasso has the practical advantage of automatically selecting from the covariate sets.An omnipresent challenge is the

*integration*of multiple molecular profiles (Gade et al. 2011; Bergersen et al. 2011; Aben et al. 2016; Boulesteix et al. 2017; Rodríguez-Girondo et al. 2017). The paired lasso is not directly suitable for analysing multiple molecular profiles simultaneously. However, for two molecular profiles with a one-to-one correspondence, the paired lasso can be used as an integrative model. A well-known example is messenger rna expression and matched dna copy number.Paired main and

*interaction*effects have the same paired structure as paired covariates. Since the paired lasso would treat the two sets of effects as two sets of covariates, it would violate the hierarchy principle. In this context, the group lasso was shown to be beneficial (Ternès et al. 2017). Although the paired lasso might also improve predictions, an adaptation would be required to enforce the hierarchy principle.

*and*covariate sets, because these are overlapping groupings.

We focussed on binary responses, but our approach also works with other univariate responses. Currently, our implementation supports linear, logistic, Poisson and Cox regression. Although it allows for \(L_1\) regularisation (lasso), \(L_2\) regularisation (ridge) and combinations thereof (elastic net), sparsity constraints require an \(L_1\) penalty, and the performance under an \(L_2\) penalty requires further research.

## Notes

### Acknowledgements

This research was funded by the Department of Epidemiology and Biostatistics, Amsterdam umc, vu University Amsterdam.

### Author Contributions

The authors contributed to this research by developing the method (ar, maw), preparing the manuscript (ar) or the appendix (ict), and revising the manuscript critically (ict, maj, rxm, maw). All authors read and approved the final manuscript.

### Compliance with ethical standards

### Conflict of interest

The authors declare that they have no potential conflicts of interest.

### Reproducibility

The R package palasso contains a vignette for reproducing all results.

### Software

The R package palasso runs on any operating system equipped with R-3.5.0 or later. It is available from cran under a free software license: https://CRAN.R-project.org/package=palasso.

## Supplementary material

## References

- Aben N, Vis DJ, Michaut M, Wessels LF (2016) TANDEM: a two-stage approach to maximize interpretability of drug response models based on multiple molecular data types. Bioinformatics 32(17):i413–i420. https://doi.org/10.1093/bioinformatics/btw449 CrossRefGoogle Scholar
- Bergersen LC, Glad IK, Lyng H (2011) Weighted lasso with data integration. Stat Appl Genet Mol Biol 10(1):39. https://doi.org/10.2202/1544-6115.1703 MathSciNetCrossRefzbMATHGoogle Scholar
- Boulesteix AL, De Bin R, Jiang X, Fuchs M (2017) IPF-LASSO: Integrative \(L_1\)-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Methods Med 2017:7691937. https://doi.org/10.1155/2017/7691937 (ipflasso)CrossRefzbMATHGoogle Scholar
- Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin. https://doi.org/10.1007/978-3-642-20192-9 CrossRefzbMATHGoogle Scholar
- Campbell F, Allen GI (2017) Within group variable selection through the exclusive lasso. Electron J Stat 11(2):4220–4257. https://doi.org/10.1214/17-EJS1317 MathSciNetCrossRefzbMATHGoogle Scholar
- Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, Sabedot TS, Malta TM, Pagnotta SM, Castiglioni I et al (2016) TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 44(8):e71. https://doi.org/10.1093/nar/gkv1507 CrossRefGoogle Scholar
- Cortes C, Mohri M (2004) AUC optimization vs. error rate minimization. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, pp 313–320Google Scholar
- Dey KK, Stephens M (2018) CorShrink: empirical Bayes shrinkage estimation of correlations, with applications. bioRxiv https://doi.org/10.1101/368316
- Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70(5):849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x MathSciNetCrossRefzbMATHGoogle Scholar
- Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw. https://doi.org/10.18637/jss.v033.i01 (glmnet)CrossRefGoogle Scholar
- Gade S, Porzelius C, Fälth M, Brase JC, Wuttig D, Kuner R, Binder H, Sültmann H, Beißbarth T (2011) Graph based fusion of miRNA and mRNA expression data improves clinical outcome prediction in prostate cancer. BMC Bioinform 12(1):488. https://doi.org/10.1186/1471-2105-12-488 CrossRefGoogle Scholar
- Huang J, Ma S, Zhang CH (2008) Adaptive lasso for sparse high-dimensional regression models. Stat Sin 18(4):1603–1618MathSciNetzbMATHGoogle Scholar
- Huang X, Stern DF, Zhao H (2016) Transcriptional profiles from paired normal samples offer complementary information on cancer patient survival-evidence from TCGA pan-cancer data. Sci Rep 6:20567. https://doi.org/10.1038/srep20567 CrossRefGoogle Scholar
- Reid S, Tibshirani R (2016) Sparse regression and marginal testing using cluster prototypes. Biostatistics 17(2):364–376. https://doi.org/10.1093/biostatistics/kxv049 MathSciNetCrossRefGoogle Scholar
- Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3):R25. https://doi.org/10.1186/gb-2010-11-3-r25 (edgeR)CrossRefGoogle Scholar
- Rodríguez-Girondo M, Kakourou A, Salo P, Perola M, Mesker WE, Tollenaar RA, Houwing-Duistermaat J, Mertens BJ (2017) On the combination of omics data for prediction of binary outcomes. In: Datta S, Mertens BJ (eds) Statistical analysis of proteomics, metabolomics, and lipidomics data using mass spectrometry. Springer, Cham, pp 259–275. https://doi.org/10.1007/978-3-319-45809-0_14 CrossRefGoogle Scholar
- Shmulevich I, Zhang W (2002) Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4):555–565. https://doi.org/10.1093/bioinformatics/18.4.555 CrossRefGoogle Scholar
- Telonis AG, Magee R, Loher P, Chervoneva I, Londin E, Rigoutsos I (2017) Knowledge about the presence or absence of miRNA isoforms (isomiRs) can successfully discriminate amongst 32 TCGA cancer types. Nucleic Acids Res 45(6):2973–2985. https://doi.org/10.1093/nar/gkx082 CrossRefGoogle Scholar
- Ternès N, Rotolo F, Heinze G, Michiels S (2017) Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biom J 59(4):685–701. https://doi.org/10.1002/bimj.201500234 MathSciNetCrossRefzbMATHGoogle Scholar
- Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288MathSciNetzbMATHGoogle Scholar
- Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108. https://doi.org/10.1111/j.1467-9868.2005.00490.x MathSciNetCrossRefzbMATHGoogle Scholar
- van de Wiel MA, Lien TG, Verlaat W, van Wieringen WN, Wilting SM (2016) Better prediction by use of co-data: adaptive group-regularized ridge regression. Stat Med 35(3):368–381. https://doi.org/10.1002/sim.6732 (GRridge)MathSciNetCrossRefGoogle Scholar
- van der Laan MJ, Polley EC, Hubbard AE (2007) Super learner. Stat Appl Genet Mol Biol 6(1):25. https://doi.org/10.2202/1544-6115.1309 MathSciNetCrossRefzbMATHGoogle Scholar
- van Wieringen WN, Kun D, Hampel R, Boulesteix AL (2009) Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal 53(5):1590–1603. https://doi.org/10.1016/j.csda.2008.05.021 MathSciNetCrossRefzbMATHGoogle Scholar
- Westfall PH (2005) Combining \(P\) values. In: Armitage P, Colton T (eds) Encyclopedia of biostatistics. Wiley, Hoboken. https://doi.org/10.1002/0470011815.b2a15181 CrossRefGoogle Scholar
- Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x MathSciNetCrossRefzbMATHGoogle Scholar
- Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429. https://doi.org/10.1198/016214506000000735 MathSciNetCrossRefzbMATHGoogle Scholar
- Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x MathSciNetCrossRefzbMATHGoogle Scholar
- Zwiener I, Frisch B, Binder H (2014) Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS ONE 9(1):e85150. https://doi.org/10.1371/journal.pone.0085150 CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.