Projection predictive variable selection for discrete response families with finite support

Weber, Frank; Glass, Änne; Vehtari, Aki

doi:10.1007/s00180-024-01506-0

Projection predictive variable selection for discrete response families with finite support

Original Paper
Open access
Published: 29 May 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Statistics Aims and scope Submit manuscript

Projection predictive variable selection for discrete response families with finite support

Download PDF

203 Accesses
1 Citation
6 Altmetric
1 Mention
Explore all metrics

Abstract

The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback–Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, if the augmented-data projection’s runtime is too high, we recommend the latent projection in the early phase of the model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection.

Variable selection for categorical response: a comparative study

Article 15 July 2022

A generalized likelihood-based Bayesian approach for scalable joint regression and covariance selection in high dimensions

Article 03 June 2022

Conditional screening for ultrahigh-dimensional survival data in case-cohort studies

Article 20 August 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The projection predictive variable selection (PPVS; Piironen et al. 2020; Catalina et al. 2022) is a special predictive model selection method (Vehtari and Ojanen 2012) for Bayesian regression models that comes with valid post-selection inference (disregarding the selection of the final model size) and has been shown to perform better—in general—than alternative methods (Piironen and Vehtari 2017a). It is based on the Bayesian decision-theoretical variable selection framework by Lindley (1968) and the practical draw-by-draw Kullback–Leibler (KL) projection proposed by Goutis and Robert (1998) and Dupuis and Robert (2003). So far, the implementation of the PPVS in the R (R Core Team 2023) package projpred^{Footnote 1} (Piironen et al. 2023) has been restricted to the Gaussian, the binomial, and the Poisson response families. Recently, the latent projection (Catalina et al. 2021) has extended the range of possible response families considerably, in particular to ordinal families relying on a single latent predictor (per observation), an example being the cumulative ordinal family from MASS::polr() (Venables and Ripley 2002). However, the latent projection is an approximate approach as it replaces the original projection problem with a latent projection problem. Here (Sect. 2), we present the exact solution to the original projection problem for discrete finite-support response families and call the corresponding procedure the augmented-data projection.

For investigating the performance of the augmented-data projection (Sect. 3), we confine ourselves to a simulation study comparing the augmented-data projection to the latent projection because the generally superior performance of the PPVS based on the traditional projection and based on the latent projection has already been demonstrated by Piironen and Vehtari (2017a) and Catalina et al. (2021), respectively.

We illustrate the application of the augmented-data projection in Sect. 4 by the help of a real-world example, thereby demonstrating another benefit of the augmented-data projection, namely the support for discrete and finite-support response families with more than one latent predictor per observation (these are response families which are not supported by the latent projection).

Finally, our work is discussed in Sect. 5, where we also mention possible modifications of the augmented-data projection to extend it to more response families in the future.

2 Augmented-data projection

2.1 Notation

For the following mathematical presentation of the augmented-data projection (a special case of the general approach that is presented first), we assume the availability of a dataset with $N$ observations. The observed response vector will be denoted by $\varvec{y} = (y_1, \dotsc , y_N)^{\scriptscriptstyle \textsf{T}} \in \mathcal {Y}^{N} \subseteq \mathbb {R}^{N}$. We do not introduce any notation for the corresponding predictor data as we will always be conditioning implicitly on it. By $\varvec{\tilde{y}} = (\tilde{y}_1, \dotsc , \tilde{y}_N)^{\scriptscriptstyle \textsf{T}}$, we will denote unobserved response values at the same observed predictor values, with realizations in $\mathcal {Y}^{N}$.

A crucial part (Piironen et al. 2020; Pavone et al. 2022) for the superior performance of the PPVS is the reference model, which is the best possible model (in terms of predictive performance) one can construct. For projpred, the reference model is usually fitted within rstanarm (Goodrich et al. 2023) or brms (Bürkner 2017, 2018) which both rely on Stan (Carpenter et al. 2017; Stan Development Team 2022b), a probabilistic programming language and software that is mainly used for its dynamic Hamiltonian Monte Carlo (HMC) algorithm, a modern Markov chain Monte Carlo (MCMC) sampler. However, the methodology behind projpred is more general and does not require the reference model to be fitted within rstanarm or brms. Thus, we start by assuming to have $S^{*}$ draws $\varvec{\theta }^{*}_{s} \in \varvec{\Theta }^{*}$ ($s \in \{1, \dotsc , S^{*}\}$) from the reference model’s posterior distribution, with $\varvec{\Theta }^{*}$ denoting the reference model’s parameter space. Furthermore, for speeding up the computations, the PPVS assumes that these $S^{*}$ posterior draws have been clustered or thinned so that $\{1, \dotsc , S^{*}\} \supseteq \mathop {\bigcup }_{c = 1}^{C} \mathcal {I}^{*}_{c}$ with disjoint (and non-empty) index sets $\mathcal {I}^{*}_{c}$. Important special cases are the single-cluster ($C = 1$) and the draw-by-draw ($C = S^{*}$) projection. If clustering (not thinning) is chosen, it is the intermediate case of $C \in \{2, \dots , S^{*} - 1\}$ which requires an actual clustering of the posterior draws (an explanation how this clustering is performed in projpred will be given below). Based on the clustering (or thinning), we can restrict the reference model’s full posterior predictive distribution (for observation $i \in \{1, \dots , N\}$) to the $c$-th cluster of posterior draws (or the $c$-th thinned draw; $c \in \{1, \dots , C\}$):

$$\begin{aligned} p(\tilde{y}_i|\mathcal {I}^{*}_{c}) = \frac{1}{|\mathcal {I}^{*}_{c}|} \sum _{s \in \mathcal {I}^{*}_{c}} p(\tilde{y}_i|\varvec{\theta }^{*}_{s}). \end{aligned}$$

In doing so, the conditioning on an index set is slightly abusing notation, but we think it improves readability while at the same time reflecting the basic idea behind this empirical average. Note that by conditioning on $\varvec{\theta }^{*}_{s}$, $p(\tilde{y}_i|\varvec{\theta }^{*}_{s})$ implicitly conditions on the observed data ($\varvec{y}$) and hence $p(\tilde{y}_i|\mathcal {I}^{*}_{c})$ conditions on $\varvec{y}$ as well. Expectations with respect to $p(\tilde{y}_i|\mathcal {I}^{*}_{c})$ will be denoted by $\mathbb {E}(\cdot |\mathcal {I}^{*}_{c})$.

A model selection problem comes with several candidate models, of which we will consider only a single one here, to avoid cluttering notation. In the context of a variable selection problem, this candidate model may also be called a submodel of the full model which includes all predictors. The parameter space of this representative submodel will be denoted by $\varvec{\Theta }$ and its parameter-conditional predictive distribution (i.e., its likelihood when regarded as a function of the parameters) by $p(\tilde{y}_i|\varvec{\theta })$ (for $\varvec{\theta } \in \varvec{\Theta }$). We emphasize that in general, $\varvec{\Theta }$ does not have to be related to $\varvec{\Theta }^{*}$ in any form (in particular, it does not have to be a restricted subspace).

Finally, we need the Kullback–Leibler (KL) divergence (Kullback and Leibler 1951) from a distribution $p(x)$ to a distribution $q(x)$:

$$\begin{aligned} D_{\textrm{KL}}\!\left( p(x) \,\Vert \, q(x) \right) = \mathbb {E}_{p(x)}\!\left( \log \frac{p(x)}{q(x)}\right) \end{aligned}$$

where we have added the subscript $p(x)$ to clarify the distribution that the expectation refers to.

For the clustering (and several other steps), projpred requires an invertible link function $g$. With this link function $g$, projpred performs the clustering of the $S^{*}$ posterior draws by applying stats::kmeans() (the stats package is part of R) to the $S^{*}$ length-$N$ vectors $\varvec{g}(\mathbb {E}(\varvec{\tilde{y}}|\varvec{\theta }^{*}_{s}))$ where $\varvec{g}$ denotes the vectorized link function, i.e., the function which applies $g$ to each element of a vector.

2.2 General approach

In general, the submodel’s projected parameter values for cluster (or thinned draw) $c \in \{1, \dotsc , C\}$ are obtained by solving

$$\begin{aligned} \varvec{\theta }_{c} = \;&\mathop {\textrm{argmin}}\limits _{\varvec{\theta } \in \varvec{\Theta }} \frac{1}{N} \sum _{i = 1}^{N} D_{\textrm{KL}}\!\left( p(\tilde{y}_i|\mathcal {I}^{*}_{c}) \,\Vert \, p(\tilde{y}_i|\varvec{\theta }) \right) \nonumber \\ = \;&\mathop {\textrm{argmax}}\limits _{\varvec{\theta } \in \varvec{\Theta }} \sum _{i = 1}^{N} \mathbb {E}\!\left( \log p(\tilde{y}_i|\varvec{\theta }) \,|\, \mathcal {I}^{*}_{c}\right) , \end{aligned}$$

(1)

see Piironen et al. (2020).

This projection problem is not easy to solve in general because $\mathbb {E}(\cdot |\mathcal {I}^{*}_{c})$ is an expectation with respect to $p(\tilde{y}_i|\mathcal {I}^{*}_{c})$. Equation (1) simplifies a lot if the submodel’s response family follows the definition from McCullagh and Nelder (1989, equation (2.4)) because in that case, $\log p(\tilde{y}_i|\varvec{\theta })$ is linear in $\tilde{y}_i$, at least for optimization with respect to the non-dispersion parameters. Another simplifying case is $|\mathcal {Y}| < \infty$, which is the gist here (see Sect. 2.3).

2.3 Discrete finite-support response families

In case of $|\mathcal {Y}| < \infty$, Eq. (1) simplifies because $\mathbb {E}(\cdot |\mathcal {I}^{*}_{c})$ is then a sum over all possible response values:

$$\begin{aligned} \varvec{\theta }_{c} = \;&\mathop {\textrm{argmax}}\limits _{\varvec{\theta } \in \varvec{\Theta }} \sum _{i = 1}^{N} \sum _{\tilde{y} \in \mathcal {Y}} a^{*}_{c, i, \tilde{y}} \log p(\tilde{y}_i = \tilde{y}|\varvec{\theta }) \end{aligned}$$

(2)

with $a^{*}_{c, i, \tilde{y}} = p(\tilde{y}_i = \tilde{y}|\mathcal {I}^{*}_{c})$. Equation (2) is simply a weighted maximum-likelihood (ML) problem when using an augmented dataset where each observation is repeated $|\mathcal {Y}|$ times and the response value is set to each possible value $\tilde{y} \in \mathcal {Y}$ in turn so that the resulting augmented dataset has a total of $N \cdot |\mathcal {Y}|$ rows. This approach is what we call the augmented-data projection, although for implementation in projpred, the augmented dataset is constructed internally to have $|\mathcal {Y}|$ blocks of $N$ rows instead of the other way round.

Equation (2) shows that the augmented-data projection consists of fitting to the fit of the reference model, a fundamental property already exhibited by the traditional projection (Piironen et al. 2020). In case of a discrete response family with finite support, the fit of the reference model just needs to be expressed differently, namely in terms of probabilities for all of the response categories, and fitting to that fit then needs to be done in a weighted fashion.

Due to the augmented-data projection being a weighted ML problem, the basic idea for implementing it in projpred is simply to apply existing R functions capable of performing a weighted ML estimation (e.g., MASS::polr() in case of the commonly used cumulative ordinal family which is described in detail in Sect. 3.1) to the augmented dataset. Currently, projpred’s augmented-data projection adds^{Footnote 2} support for the brms::cumulative() family (encoding the cumulative ordinal family), for rstanarm::stan_polr() fits (employing the cumulative ordinal family), and for the brms::categorical() family (encoding the nominal family used in multinomial logistic regression). We emphasize that these families refer to the submodels, not to the reference model. Typically, the reference model has the same response family as the submodels, but in general, the reference model is allowed to have a different family. In case of the augmented-data projection, the only requirement concerning the form of the reference model is that its response family must be discrete and have finite support (otherwise, the step from Eq. (1) to Eq. (2) would be incorrect). In theory, Eq. (2) does not require the submodel to have a discrete finite-support response family, but typically—and especially with respect to the implementation in projpred—this requirement makes sense.

The augmented-data projection has been added in version 2.4.0 of projpred (Piironen et al. 2023). In that version, an updated implementation of the latent projection (compared to Catalina et al. 2021) has been included as well. Note that for applying both—the augmented-data projection and the updated implementation of the latent projection—to reference model fits from brms, version 2.19.0 (or later) of brms is needed.

3 Simulation study

For the following simulation study comparing augmented-data and latent projection, we assume that the reader is familiar with the typical projpred workflow, as presented in the main vignette of the projpred package, for example.

3.1 Setup

Since the brms::categorical() family (see Sect. 2.3) comes with more than one latent predictor per observation (if $|\mathcal {Y}| > 2$, which is the relevant case), it is not supported by the latent projection. Hence, our simulation study is restricted to the brms::cumulative() family (which encodes the same observation model as in rstanarm::stan_polr() fits, see Sect. 2.3).

More specifically, to comply with Catalina et al. (2021), we use $J = |\mathcal {Y}| = 5$ response categories and the probit link function $g = \Phi ^{-1}$ (the quantile function of the standard normal distribution). The number of observations is set to $N = 100$, in accordance with the value used throughout the main article of Catalina et al. (2021).

Then, for each of $R = 100$ simulation iterations, the simulation study involves the following steps:

1.
Define the $J - 1$ latent thresholds (intercepts) $\zeta _{j}$ ($j \in \{1, \dotsc , J - 1\}$) as
$$\begin{aligned} \zeta _{j} = g\!\left( \frac{j}{J}\right) . \end{aligned}$$
2.
Generate $P = 50$ regression coefficients $\beta _{p}$ ($p \in \{1, \dotsc , P\}$) according to a regularized horseshoe prior (Piironen and Vehtari 2017c). The underlying mechanism may be found in the R code for this simulation study (see the link at the end of this section). Here, we choose a global scale parameter of
$$\begin{aligned} \tau _{0} = \frac{p_{0}}{P - p_{0}} \cdot \frac{\tilde{\sigma }}{\sqrt{N}} \end{aligned}$$
with $p_{0} = 10$ and
$$\begin{aligned} \tilde{\sigma }^2 = \exp \!\left( \frac{1}{J} \sum _{j = 1}^{J} \log \tilde{\sigma }_{j}^2\right) = \root J \of {\prod _{j = 1}^{J} \tilde{\sigma }_{j}^2}, \end{aligned}$$
where $\tilde{\sigma }_{j}^2$ is calculated according to Section 3.5 of Piironen and Vehtari (2017c), taking the same thresholds $\zeta _{j}$ as defined above and assuming a typical data point with a latent predictor of zero so that all response categories are equally likely (in analogy to the approach of Piironen and Vehtari (2017b), in case of the binomial family with the logit link). Here, we obtain an overall pseudo variance of $\tilde{\sigma }^2 \approx 1.06^2$. For the Student-$t$ slab of the regularized horseshoe prior, we choose $100$ degrees of freedom (effectively yielding a Gaussian slab) and a scale parameter of $1$.
3.
Generate a training dataset according to the following data-generating model where $i \in \{1, \dots , N\}$:
$$\begin{aligned}&x_{i, p} \sim \mathcal {N}(0, 1) \quad (p \in \{1, \dots , P\}),\\&\eta _i = \sum _{p = 1}^{P} \beta _{p} x_{i, p},\\&\varvec{\zeta } = (\zeta _{1}, \dotsc , \zeta _{J - 1})^{\scriptscriptstyle \textsf{T}},\\&y_i \sim \textrm{Cumul}(\varvec{\zeta }, \eta _i), \end{aligned}$$
where $\mathcal {N}(\mu , \sigma )$ denotes a normal distribution with mean $\mu$ and standard deviation $\sigma$ and $\textrm{Cumul}(\varvec{\zeta }, \eta _i)$ denotes the distribution with probability mass function
$$\begin{aligned} p(y_i = j|\varvec{\zeta }, \eta _i) = g^{-1}(\zeta _{j} - \eta _i) - g^{-1}(\zeta _{j - 1} - \eta _i) \end{aligned}$$
for $j \in \{1, \dots , J\} = \mathcal {Y}$, exploiting auxiliary elements $\zeta _{0} = -\infty$ and $\zeta _{J} = \infty$ and defining $g^{-1}(-\infty ) = 0$ and $g^{-1}(\infty ) = 1$ (as well as $\pm \infty - b = \pm \infty$ for $b \in \mathbb {R}$).
4.
Generate an independent test dataset using the same data-generating model and the same settings (in particular, the same number $N$ of observations) as for the training data.
5.
Fit a reference model to the training data, using the data-generating model as the data-fitting model, except that the prior for the thresholds $\zeta _{j}$ ($j \in \{1, \dotsc , J - 1\}$) is set to $\mathcal {N}(0, 2.5)$ in the data-fitting model. The reference model fit is performed by brms::brm(), using the cmdstanr (Gabry and Češnovar 2022) backend. We use the default of $4$ Markov chains, each running $1000$ warmup and $1000$ post-warmup iterations. In order to avoid spurious divergences of Stan’s dynamic HMC sampler, we aim at smaller step sizes by setting adapt_delta = 0.99. By specifying init = 1, we narrow down the range that the initial parameter values are randomly drawn from (this was necessary to avoid that occasionally, some chains would initialize in an area of the parameter space with log posterior density numerically equal to $-\infty$ or—shortly after initialization—would run into such an area). We checked the convergence of the Markov chains for an initial reference model fit (based on a dataset independent of those from the $R = 100$ simulation iterations) by the help of common MCMC diagnostics (Betancourt 2018; Vehtari et al. 2021; Stan Development Team 2022a; Bürkner et al. 2023).
6.
Run the following projpred steps twice (once with the augmented-data projection and once with the latent projection, but based on the same training and test data and based on the same reference model fit):
1. (a)
  Run projpred::varsel(), which consists of two main steps: a heuristic search and the calculation of precursor quantities required for the estimation of predictive performance statistics along the predictor ranking returned by the search (see the next enumeration point, 6(b)). In doing so, we pass the test data via argument d_test. As search method, we choose the forward search because projpred’s augmented-data projection currently does not support the L1 search and also because the L1 search is often less accurate. Apart from that, we leave all other arguments at their default.
2. (b)
  For each submodel size along the predictor ranking.^{Footnote 3}: Retrieve the mean log predictive density (MLPD; actually mean log predictive probability, but the same acronym is used for simplicity), $\Delta \textrm{MLPD} = \textrm{MLPD} - \textrm{MLPD}^{*}$ (with $\textrm{MLPD}^{*}$ denoting the reference model MLPD), and the corresponding standard errors (SEs). This is achieved via projpred:::summary.vsel() applied to the output of projpred::varsel() from the previous enumeration point (6(a)), once with deltas = FALSE (for the MLPD) and once with deltas = TRUE (for $\Delta \textrm{MLPD}$). Here, the MLPD is the chosen performance statistic because of the desirable properties of the log score in general (Vehtari and Ojanen 2012) and because $\exp (\textrm{MLPD})$, the geometric mean predictive density (GMPD), has an interpretable scale of $(0, 1)$ in case of a discrete response family. We denote the MLPD based on the augmented-data projection by $\textrm{MLPD}_{\textrm{aug}}$ and the corresponding $\Delta \textrm{MLPD}$ value by $\Delta \textrm{MLPD}_{\textrm{aug}}$. For the latent projection, these are denoted by $\textrm{MLPD}_{\textrm{lat}}$ and $\Delta \textrm{MLPD}_{\textrm{lat}}$, respectively.
3. (c)
  Suggest a submodel size via projpred::suggest_size(). As underlying performance statistic, we choose the MLPD again, for consistency with the results retrieved from projpred:::summary.vsel(). We denote the suggested size based on the augmented-data projection by $G_{\textrm{aug}}$ and the suggested size based on the latent projection by $G_{\textrm{lat}}$. Since we use the default arguments of projpred::suggest_size() (apart from the predictive performance statistic which is set to "mlpd"), $G_{\textrm{aug}}$ is given by the smallest submodel size for which $\Delta \textrm{MLPD}_{\textrm{aug}} + \textrm{SE}(\Delta \textrm{MLPD}_{\textrm{aug}}) \ge 0$ (and $G_{\textrm{lat}}$ is defined analogously). If there is no submodel size fulfilling this criterion, NA is returned.

The R code for this simulation study is available on GitHub.^{Footnote 4} Figures were created with ggplot2 (Wickham 2016).

3.2 Results

A central part of the projpred workflow is the plot of the chosen performance statistic (relative to the reference model’s performance) in dependence of the submodel size. Basically, this is also what is shown in Fig. 1, but slightly adapted to a simulation study: The lines from all $R = 100$ simulation iterations are combined into one plot for the augmented-data and the latent projection, respectively. To avoid an overly crowded plot, the uncertainty bars that are otherwise part of this plot have been omitted.

A reassuring conclusion from Fig. 1 is that for both projection methods, an increasing submodel size eventually causes the predictive performance of the submodels to approach that of the reference model, although there are simulation iterations where a certain discrepancy to the reference model performance persists even at large submodel sizes. Nevertheless, we can conclude that both projection methods pass a basic check for being implemented correctly.

Figure 1 also shows that in some simulation iterations, the augmented-data projection’s MLPDs at small to moderate submodel sizes are closer to the reference model MLPD than those from the latent projection. This is even more evident from Fig. 2 where $\Delta \textrm{MLPD}_{\textrm{lat}} - \Delta \textrm{MLPD}_{\textrm{aug}} = \textrm{MLPD}_{\textrm{lat}} - \textrm{MLPD}_{\textrm{aug}}$ is illustrated. Figure 2 also reveals that there are a few simulation iterations where the latent projection leads to a better predictive performance at large submodel sizes. These simulation iterations are investigated in more detail in Appendix A (Online Resource 2).

An inspection of the MLPD (or rather GMPD) values on absolute scale in Appendix B (Online Resource 2) reveals that in extreme cases, the discrepancy in predictive performance between augmented-data and latent projection is indeed non-negligible.

The lack of uncertainty bars in Figs. 1 and 2 obscures the fact that all underlying predictive performance values are only estimates. Thus, it is important to inspect, for example, the corresponding standard errors (SEs). This is achieved by Fig. 3 which depicts the differences $\textrm{SE}(\Delta \textrm{MLPD}_{\textrm{lat}}) - \textrm{SE}(\Delta \textrm{MLPD}_{\textrm{aug}})$. The mostly positive differences in Fig. 3 show that the latent projection is associated with greater uncertainty than the augmented-data projection. Analogously to the peaks at large submodel sizes from Fig. 2, there are latent-projection SEs at large submodel sizes which are noticeably smaller than their counterparts based on the augmented-data projection. As a side-effect, Appendix A (Online Resource 2) reveals that the SEs from one of the simulation iterations investigated there are part of this rare case.

In the typical projpred workflow, the plot of the chosen performance statistic in dependence of the submodel size is mainly used in the decision for a submodel size for the final projection. Ideally, this plot-based decision is made manually by incorporating subject-matter knowledge, application-specific trade-offs, and the absolute scale of the predictive performance statistic. In a real-world application, the heuristic offered by projpred::suggest_size() should only be interpreted as a suggestion, but for the purpose of a simulation study, such a heuristic is helpful. Figure 4 illustrates the frequency (across the simulation iterations) of all encountered differences $G_{\textrm{lat}} - G_{\textrm{aug}}$ of the sizes $G_{\textrm{lat}}$ and $G_{\textrm{aug}}$ suggested by this heuristic. The high peak of the distribution at zero shows that the augmented-data and the latent projection often result in the same suggestion for the submodel size. Moreover, the slight right-skewness of the distribution (i.e., the presence of a few large positive differences) indicates that there are some simulation iterations where the latent projection leads to a clearly larger suggested size than the augmented-data projection. This slower convergence of the submodel MLPDs towards the reference model MLPD in case of the latent projection was already visible more directly in Figs. 1 and 2. It is also reflected (indirectly) by the larger frequency of $\texttt {NA}_{\textrm{lat}}$ compared to $\texttt {NA}_{\textrm{aug}}$ in Fig. 4. A first glance at Figs. 1 and 2 might lead to think that larger suggested sizes in case of the latent projection should be more frequent than they are, but uncertainty needs to be taken into account, too: The bigger SEs in case of the latent projection (Fig. 3) may cause the latent projection to arrive at similar suggested sizes as the augmented-data projection, even if the latent-projection submodel MLPDs approach the reference model MLPD more slowly.

The slower convergence towards the reference model MLPD in case of the latent projection is also visible in a slight left-skewness (with peak around zero) of the distribution of $\textrm{MLPD}_{\textrm{lat}} - \textrm{MLPD}_{\textrm{aug}}$ at submodel size $G_{\textrm{min}} = \min (G_{\textrm{aug}}, G_{\textrm{lat}})$ (provided at least one of $G_{\textrm{aug}}$ and $G_{\textrm{lat}}$ is non-NA) across all simulation iterations (Fig. 5).

Finally, Fig. 6 shows the runtime of the projpred::varsel() call for both projection methods. Clearly, the augmented-data projection takes much longer (median runtime across all simulation iterations: ca. 14.6 min) than the latent projection (median runtime across all simulation iterations: ca. 1.5 min). This is the price to pay for the exact projection instead of the approximate latent projection.

4 Example: Renal cell carcinoma subtyping

We illustrate the application of the augmented-data projection embedded in a PPVS for a nominal response variable using a cancer dataset from the Institute of Pathology of the Rostock University Medical Center (Germany). This dataset is available in Online Resource 1 and consists of those 285 observations (patients) with complete records from the larger dataset used by Zimpfer et al. (2019).

Zimpfer et al. (2019) conducted a retrospective study for renal cell carcinoma (RCC) subtyping in accordance with the 2016 WHO classification. RCC subtyping is of prognostic relevance for patients and thus crucial to be determined accurately. In Zimpfer et al. (2019), RCC subtyping was performed histologically by trained pathologists. Our data contains three RCC subtypes: clear-cell RCC (relative frequency: ca. $86.0\,\%$), papillary RCC (ca. $9.5\,\%$), and a set of rare (WHO-unclassified) subtypes (ca. $4.6\,\%$).

Despite the focus on determining the RCC subtype accurately, it is also helpful to predict the RCC subtype as early as possible during the process of patient care. Thus, we apply a PPVS with the three-level RCC subtype as response. On the side of the predictors, our reference model consists of the main effects and all possible two-way interactions of the following seven predictor variables which were chosen based on Table 2 of Zimpfer et al. (2019):

age: age at diagnosis (in years),
sex: sex ("female" or "male"),
grade: histologic tumor grade (coded as "G1G2" for grades G1–G2 and "G3G4" for G3–G4),
stage: histologic tumor stage (coded as "T1T2" for stages T1–T2 and "T3T4" for T3–T4),
nodes: nodal metastases spread nearby (coded as "no" for N0 and "yes" for N1),
metastases: metastases 0-6 months post-diagnosis (coded as "no" for M0 and "yes" for M1),
resection: classification of the resection margin (coded as "R0" for R0 and "R1R2" for R1–R2).

This gives $7 + \left( {\begin{array}{c}7\\ 2\end{array}}\right) = 28$ predictors per response category (except for the reference category), a situation that—given the 285 observations available here—does not really belong to the “small $n$, large $p$” regime mentioned by Piironen et al. (2020), but the “small $n$, large $p$” regime is only the regime the PPVS was originally applied to. In practice, the PPVS may as well be applied to “small $n$, small $p$” or other regimes.

In the following, we only describe modeling choices deviating from the defaults of the respective R function arguments. For details on the whole procedure, see the R code provided in Online Resource 1.

For fitting the reference model, we use the brms::categorical() response family from the R package brms. For the regression coefficients, we choose the R2-D2 prior (Zhang et al. 2022) as implemented in brms. In case of the brms::categorical() family, the R2-D2 prior’s $R^2$ parameter does not have an intuitive interpretation (in contrast to normal linear models), but smaller $R^2$ values still imply a stronger penalization. Here, we choose a mean of $0.4$ and a pseudo-precision parameter of $2.5$ for the Beta prior on $R^2$, so slightly more penalization than implied by the default uniform Beta prior. The current implementation of the R2-D2 prior in brms requires a comparable scale of the predictors (except if differing scales have a meaning with respect to the relevance of predictors, in the sense that predictors with a larger scale should be more relevant, which we don’t assume here). Thus, as suggested by Gelman et al. (2008), we scale the only continuous predictor variable age to a standard deviation of $0.5$ (which corresponds to the standard deviation of a binary predictor with a relative frequency of $50\,\%$ for both categories). Prior to scaling age, we center it to a mean of $0$.

The convergence of the Markov chains in the brms reference model fit seems to be given: All checks that we already performed in the simulation study (Sect. 3.1) are passed. Furthermore, we conduct some basic checks for the reference model to be appropriate from a predictive point of view. These checks (not shown here, but reported in Online Resource 1) reveal that the reference model’s predictions are largely driven by the intercepts. (In a brms::categorical() model, the intercepts transformed to response scale—i.e., to probabilities—reflect the hypothetical frequencies of the response categories at predictor values of zero.) In this sense, the reference model (or rather the data it is based upon) is suboptimal, but still sufficient for illustrative purposes.

Within projpred, we perform the PPVS using a $K$-fold cross-validation ($K$-fold CV), here with $K = 30$. Based on a preliminary projpred::cv_varsel() run with Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV, (Vehtari et al. 2017, 2022)) and a full-data search (i.e., a search that was not run separately for each CV fold), we restrict the maximum submodel size for the fold-wise searches in the final projpred::cv_varsel() run (the $K$-fold one) to $3$, thereby saving computational resources.

The whole projpred part of our code takes approximately 15 min on a standard desktop machine. The final projpred::cv_varsel() run yields the predictive performance plot depicted in Fig. 7.

Based on Fig. 7, we choose a submodel size of $2$. The heuristic implemented in projpred::suggest_size() would have given a size of $1$ (because size $1$ is the smallest size where the submodel MLPD point estimate is less than one standard error smaller than the reference model MLPD point estimate). Here, we choose the slightly bigger size of $2$ due to the special medical context where the primary goal is predictive accuracy, and sparsity being a secondary goal.

The summary of the fold-wise predictor rankings presented in Table 1 shows that all $K = 30$ CV folds agree on the first two predictors: metastases and nodes (in this order). Thus, our selected submodel consists of these two predictors. After a final projection of the reference model onto this submodel (this time using the draw-by-draw method, i.e., projecting each posterior draw from the reference model onto the submodel parameter space without any clustering), we can make predictions with this submodel. These predictions are presented in Table 2 (this compact form is possible here because there are only two binary predictors). Although the absolute changes in the predictive probabilities might at first seem quite large (up to about $22\,\%$ when changing only one predictor at a time, and up to about $27\,\%$ when changing both predictors simultaneously), the predictive probabilities are still dominated by the empirical frequencies of the response categories in the data and thus by the intercepts. As mentioned above, this was already observable in the reference model. Therefore, it is clear that this pattern is also visible here: Model selection cannot be expected to yield a model with better predictions than the reference model (Vehtari and Ojanen 2012; Piironen and Vehtari 2017a), especially in the context of projections which are essentially fitting to the fit of the reference model.

Table 1 Summary of the fold-wise predictor rankings

Full size table

Table 2 All possible predictions from the final submodel

Full size table

5 Discussion

We have presented how the projective part of the PPVS can be performed in case of a discrete response family with finite support. This augmented-data projection has been implemented as an extension of the projpred R package.

Apart from the presentation of the methodology, the purpose of this paper was to compare the augmented-data projection to the latent projection, an alternative projection method that is far more general than the augmented-data projection and covers many discrete finite-support response families as well. The simulation study we have conducted to this end demonstrated that most of the time, the two projection methods behave quite similarly in terms of predictive performance and the submodel size found by the projpred::suggest_size() heuristic. In some cases, the augmented-data projection yields a better predictive performance and (although not necessarily in the same cases) a smaller suggested size than the latent projection. In even less frequent cases, it is the latent projection which yields a better predictive performance and a smaller suggested size.

Overall (i.e., across all simulation iterations), the predictive performance of the submodels and the variable selection based upon it seem to be more stable in case of the augmented-data projection. This is probably due to the exact nature of the augmented-data projection, as opposed to the approximate nature of the latent projection. For example, in case of the ordinal family used here, one reason for the worse stability of the latent projection could be that it uses the reference model’s draws of the threshold parameters to compute response-scale output (such as the response-scale MLPD) for a submodel: In general, the smaller the submodel size, the larger the lack of fit between the latent predictor of a submodel and the latent predictor of the reference model will be. When using an ad-hoc solution for computing (response-scale) predictive probabilities by relying on the reference model’s thresholds, a lack of fit in the latent predictor causes the predictive probabilities of a submodel to become suboptimal without the projection noticing this (and thus without the possibility for the projection to adjust the regression coefficients). In contrast, the augmented-data projection aims at reproducing directly the predictive probabilities of the reference model, adjusting both, the regression coefficients and the thresholds of a submodel. In principle, the latent projection also allows to calculate the predictive performance statistic(s) and other post-projection quantities on latent scale. By converting the results from the augmented-data projection to latent scale as well, we could have tried to compare the augmented-data and the latent projection on latent scale. However, in settings like ours where there is an independent test dataset (and the same applies to $K$-fold CV), it is not straightforward to define how the latent-scale predictions for the test dataset should be calculated (using the reference model fit based on the training data would induce a dependency between training and test data). Furthermore, latent-scale performance statistics like the latent-scale MLPD are not easily interpretable. Hence, we did not perform latent-scale analyses in our simulation study.

The MLPD was the only predictive performance statistic in our simulation study. In principle, the classification accuracy could be used as an alternative performance statistic in discrete finite-support observation models. However, especially in case of a moderate to large number of response categories (like the $J = 5$ categories in our simulation study), this comes with a loss of information that the MLPD does not exhibit: For example, if the true response category of an observation is category $3$ (out of $5$) and a model gives a predictive probability of $23\,\%$ for category $3$, a predictive probability of $24\,\%$ for category $4$, and predictive probabilities smaller than $23\,\%$ for all other categories, then the prediction of the highest-probability category would lead to a misclassification in the zero–one utility spirit of the classification accuracy. The MLPD is smoother in the sense that the log predictive probability of that observation is $\log (0.23)$, which would not differ much from the log predictive probability of $\log (0.24)$ in a situation where the predictive probabilities for categories $3$ and $4$ were reversed. In any case, even if the accuracy may be considered appropriate in some use cases (after all, the choice of performance statistic is an application-specific one), we do not expect our main conclusions to change significantly in case of alternative performance statistics.

The cost of the augmented-data projection’s higher stability is a considerable increase in runtime. Because of this, it might be helpful to use the latent projection for preliminary results in the model-building workflow and to use the augmented-data projection afterwards for final results. One particular purpose of a preliminary latent-projection run could be to find a reasonable value for argument nterms_max of projpred::varsel() or projpred::cv_varsel() (this argument determines up to which submodel size the search should be conducted) because often, nterms_max can be chosen smaller than the value implied by the default heuristic, which reduces the runtime for the final augmented-data projection significantly.

The only number of response categories investigated in our simulation study was $J = 5$, which may be regarded as overly restrictive. In Appendix C (Online Resource 2), we show how our results would have changed in case of $J = 3$ and $J = 7$. Most importantly, our main conclusion (the recommendation of using the latent projection for preliminary results and the augmented-data projection for final results) would not have changed.

An advantage of the augmented-data projection that was shortly mentioned in Sect. 2.3 and later illustrated in the example from Sect. 4 is the support for nominal families (like brms::categorical()) which rely on more than one latent predictor per observation (typically, nominal families come with one observation-wise latent predictor for each response category except for a reference category). So far, such families are not supported by the latent projection.

It is worth mentioning that just like the traditional PPVS, the PPVS based on the augmented-data projection is—by construction—sensitive to the choice of reference model. For the traditional PPVS, this is demonstrated, e.g., by McLatchie et al. (2023). For our cancer subtyping example from Sect. 4, this is demonstrated in Appendix D (Online Resource 2).

In the future (and if requested by users), the implementation of the augmented-data projection in projpred can be extended to more exotic discrete finite-support response families in a straightforward manner (see Sect. 2.3).

Furthermore, the augmented-data projection might also be applicable to continuous response families and discrete families with infinite support, using either a Monte Carlo or a discretization approach for achieving an artificial support that is discrete and finite. The Monte Carlo approach might require a clustering or some other kind of grouping of the response draws to arrive at a practicable number of response categories. For the discretization approach, it might be possible to borrow ideas from Röver and Friede (2017).

Finally, we note that the augmented-data projection in projpred also supports multilevel models. Since the PPVS for multilevel models (in general) is currently subject to more detailed investigations, we leave the comparison of augmented-data and latent projection for multilevel models for future research.

6 Supplementary information

The following Online Resources are available for this article:

Online Resource 1: Example files. The files (dataset, code, and additional output) for the example from Sect. 4.

Online Resource 2: Appendices. A single document containing multiple appendices with further simulation results and a sensitivity analysis for the reference model from Sect. 4.

Data Availibility Statement

The datasets generated during Sect. 3 of the current study are not publicly available as they were generated only temporarily and would require a large amount of storage space. However, they are reproducible with the help of the R code for which a link is provided at the end of Sect. 3.1. Binary files containing the simulation results (in the form of R objects) are available from the corresponding author on reasonable request. The dataset analyzed during Sect. 4 of this study is included in the supplementary information of this article.

Notes

Currently, projpred may be regarded as the most popular Bayesian variable selection package for R. This can be checked by comparing the download numbers for projpred, BayesVarSel (Garcia-Donato and Forte 2018), BAS (Clyde 2022), varbvs (Carbonetto and Stephens 2012), spikeSlabGAM (Scheipl 2011), BVSNLP (Nikooienejad and Johnson 2020), ptycho (Stell and Sabatti 2015), BayesSUR (Zhao et al. 2021), BGLR (Perez and de los Campos G, 2014), MBSGS (Liquet and Sutton 2017), and mombf (Rossell et al. 2023) via cranlogs (Csárdi 2019). Last check: April 26, 2023.
These families are additional in comparison to projpred’s traditional projection; projpred’s latent projection already supports these families, except for the brms::categorical() family.
The predictor ranking is also known as solution path.
https://github.com/fweber144/simauglat/tree/cb41d99.

References

Betancourt M (2018) A conceptual introduction to Hamiltonian Monte Carlo. https://doi.org/10.48550/arXiv.1701.02434
Bürkner PC (2017) brms: an R package for Bayesian multilevel models using Stan. J Stat Softw 80(1):1–2. https://doi.org/10.18637/jss.v080.i01
Article Google Scholar
Bürkner PC (2018) Advanced Bayesian multilevel modeling with the R package brms. R J 10(1):395–411. https://doi.org/10.32614/RJ-2018-017
Article Google Scholar
Bürkner PC, Gabry J, Kay M et al (2023) posterior: tools for working with posterior distributions. https://mc-stan.org/posterior/, R package, version 1.4.1
Carbonetto P, Stephens M (2012) Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal 7(1):73–108. https://doi.org/10.1214/12-BA703
Article MathSciNet Google Scholar
Carpenter B, Gelman A, Hoffman MD et al (2017) Stan: a probabilistic programming language. J Stat Softw 76(1):1–32. https://doi.org/10.18637/jss.v076.i01
Article Google Scholar
Catalina A, Bürkner P, Vehtari A (2021) Latent space projection predictive inference. https://doi.org/10.48550/arXiv.2109.04702
Catalina A, Bürkner PC, Vehtari A (2022) Projection predictive inference for generalized linear and additive multilevel models. In: Camps-Valls G, Ruiz FJR, Valera I (eds) Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol 151. PMLR, pp 4446–4461. https://proceedings.mlr.press/v151/catalina22a.html
Clyde M (2022) BAS: Bayesian variable selection and model averaging using Bayesian adaptive sampling. https://CRAN.R-project.org/package=BAS, R package, version 1.6.4
Csárdi G (2019) cranlogs: download logs from the ’RStudio’ ’CRAN’ mirror. https://CRAN.R-project.org/package=cranlogs, R package, version 2.1.1
Dupuis JA, Robert CP (2003) Variable selection in qualitative models via an entropic explanatory power. J Stat Plan Inference 111(1–2):77–94. https://doi.org/10.1016/S0378-3758(02)00286-0
Article MathSciNet Google Scholar
Gabry J, Češnovar R (2022) cmdstanr: R interface to ’CmdStan’. https://mc-stan.org/cmdstanr/, R package, version 0.5.3
Garcia-Donato G, Forte A (2018) Bayesian testing, variable selection and model averaging in linear models using R with BayesVarSel. R J 10(1):155–174. https://doi.org/10.32614/RJ-2018-021
Article Google Scholar
Gelman A, Jakulin A, Pittau MG et al (2008) A weakly informative default prior distribution for logistic and other regression models. Ann Appl Stat 2(4):1360–1383. https://doi.org/10.1214/08-AOAS191
Article MathSciNet Google Scholar
Goodrich B, Gabry J, Ali I et al (2023) rstanarm: Bayesian applied regression modeling via Stan. https://mc-stan.org/rstanarm/, R package, version 2.21.4
Goutis C, Robert CP (1998) Model choice in generalised linear models: a Bayesian approach via Kullback–Leibler projections. Biometrika 85(1):29–37
Article MathSciNet Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86. https://doi.org/10.1214/aoms/1177729694
Article MathSciNet Google Scholar
Lindley DV (1968) The choice of variables in multiple regression. J R Stat Soc Ser B Methodol 30(1):31–66
Article MathSciNet Google Scholar
Liquet B, Sutton M (2017) MBSGS: multivariate Bayesian sparse group selection with spike and slab. https://CRAN.R-project.org/package=MBSGS, R package, version 1.1.0
McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, London
Book Google Scholar
McLatchie Y, Rögnvaldsson S, Weber F et al (2023) Robust and efficient projection predictive inference. https://doi.org/10.48550/arXiv.2306.15581
Nikooienejad A, Johnson VE (2020) BVSNLP: Bayesian variable selection in high dimensional settings using nonlocal priors. https://CRAN.R-project.org/package=BVSNLP, R package, version 1.1.9
Pavone F, Piironen J, Bürkner PC et al (2022) Using reference models in variable selection. Comput Stat. https://doi.org/10.1007/s00180-022-01231-6
Article Google Scholar
Perez P, de Campos G (2014) Genome-wide regression and prediction with the BGLR statistical package. Genetics 198(2):483–495
Article Google Scholar
Piironen J, Vehtari A (2017a) Comparison of Bayesian predictive methods for model selection. Stat Comput 27(3):711–735. https://doi.org/10.1007/s11222-016-9649-y
Article MathSciNet Google Scholar
Piironen J, Vehtari A (2017b) On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. In: Singh A, Zhu J (eds) Proceedings of The 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol 54. PMLR, pp 905–913. https://proceedings.mlr.press/v54/piironen17a.html
Piironen J, Vehtari A (2017c) Sparsity information and regularization in the horseshoe and other shrinkage priors. Electron J Stat 11(2):5018–5051. https://doi.org/10.1214/17-EJS1337SI
Article MathSciNet Google Scholar
Piironen J, Paasiniemi M, Vehtari A (2020) Projective inference in high-dimensional problems: prediction and feature selection. Electron J Stat 14(1):2155–2197. https://doi.org/10.1214/20-EJS1711
Article MathSciNet Google Scholar
Piironen J, Paasiniemi M, Catalina A et al (2023) projpred: projection predictive feature selection. https://mc-stan.org/projpred/, R package, version 2.5.0
R Core Team (2023) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/
Rossell D, Cook JD, Telesca D et al (2023) mombf: model selection with Bayesian methods and information criteria. https://CRAN.R-project.org/package=mombf, R package, version 3.3.1
Röver C, Friede T (2017) Discrete approximation of a mixture distribution via restricted divergence. J Comput Graph Stat 26(1):217–222. https://doi.org/10.1080/10618600.2016.1276840
Article MathSciNet Google Scholar
Scheipl F (2011) spikeSlabGAM: Bayesian variable selection, model choice and regularization for generalized additive mixed models in R. J Stat Softw 43(14):1–24. https://doi.org/10.18637/jss.v043.i14
Article Google Scholar
Stan Development Team (2022a) Runtime warnings and convergence problems. https://mc-stan.org/misc/warnings.html, version from March 10, 2022. Accessed 13 April 2022
Stan Development Team (2022b) Stan modeling language users guide and reference manual, Version 2.31. https://mc-stan.org
Stell L, Sabatti C (2015) ptycho: Bayesian variable selection with hierarchical priors. https://CRAN.R-project.org/package=ptycho, R package, version 1.1-4
Vehtari A, Ojanen J (2012) A survey of Bayesian predictive methods for model assessment, selection and comparison. Stat Surv 6:142–228. https://doi.org/10.1214/12-SS102
Article MathSciNet Google Scholar
Vehtari A, Gelman A, Gabry J (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27(5):1413–1432. https://doi.org/10.1007/s11222-016-9696-4
Article MathSciNet Google Scholar
Vehtari A, Gelman A, Simpson D et al (2021) Rank-normalization, folding, and localization: an improved $\widehat{R}$ for assessing convergence of MCMC (with discussion). Bayesian Anal 16(2):667–718. https://doi.org/10.1214/20-BA1221
Article MathSciNet Google Scholar
Vehtari A, Simpson D, Gelman A et al (2022) Pareto smoothed importance sampling. https://doi.org/10.48550/arXiv.1507.02646
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York, https://www.stats.ox.ac.uk/pub/MASS4/
Wickham H (2016) ggplot2: elegant graphics for data analysis, 2nd edn. Springer, New York, https://doi.org/10.1007/978-3-319-24277-4, https://ggplot2.tidyverse.org
Zhang YD, Naughton BP, Bondell HD et al (2022) Bayesian regression using a prior on the model fit: the R2–D2 shrinkage prior. J Am Stat Assoc 117(538):862–874. https://doi.org/10.1080/01621459.2020.1825449
Article MathSciNet Google Scholar
Zhao Z, Banterle M, Bottolo L et al (2021) BayesSUR: an R package for high-dimensional multivariate Bayesian variable and covariance selection in linear regression. J Stat Softw 100(11):1–32. https://doi.org/10.18637/jss.v100.i11
Article Google Scholar
Zimpfer A, Glass Ä, Zettl H et al (2019) Histopathologische Diagnose und Prognose des Nierenzellkarzinoms im Kontext der WHO-Klassifikation 2016. Urologe 58(9):1057–1065. https://doi.org/10.1007/s00120-019-0952-z
Article Google Scholar

Download references

Acknowledgements

We thank the Research Council of Finland (formerly “Academy of Finland”), grant 340721, for partial funding of this research. We also acknowledge the computational resources provided by the University of Rostock.

Author information

Authors and Affiliations

Institute for Biostatistics and Informatics in Medicine and Ageing Research, Rostock University Medical Center, Rostock, Germany
Frank Weber & Änne Glass
Department of Computer Science, Aalto University, Helsinki, Finland
Aki Vehtari

Authors

Frank Weber
View author publications
You can also search for this author in PubMed Google Scholar
Änne Glass
View author publications
You can also search for this author in PubMed Google Scholar
Aki Vehtari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Weber.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 147 KB)

Supplementary file 2 (PDF 1096 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Weber, F., Glass, Ä. & Vehtari, A. Projection predictive variable selection for discrete response families with finite support. Comput Stat (2024). https://doi.org/10.1007/s00180-024-01506-0

Download citation

Received: 26 April 2023
Accepted: 03 May 2024
Published: 29 May 2024
DOI: https://doi.org/10.1007/s00180-024-01506-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Projection predictive variable selection for discrete response families with finite support

Abstract

Similar content being viewed by others

Variable selection for categorical response: a comparative study

A generalized likelihood-based Bayesian approach for scalable joint regression and covariance selection in high dimensions

Conditional screening for ultrahigh-dimensional survival data in case-cohort studies

1 Introduction