1 Introduction

The projection predictive variable selection (PPVS; Piironen et al. 2020; Catalina et al. 2022) is a special predictive model selection method (Vehtari and Ojanen 2012) for Bayesian regression models that comes with valid post-selection inference (disregarding the selection of the final model size) and has been shown to perform better—in general—than alternative methods (Piironen and Vehtari 2017a). It is based on the Bayesian decision-theoretical variable selection framework by Lindley (1968) and the practical draw-by-draw Kullback–Leibler (KL) projection proposed by Goutis and Robert (1998) and Dupuis and Robert (2003). So far, the implementation of the PPVS in the R (R Core Team 2023) package projpredFootnote 1 (Piironen et al. 2023) has been restricted to the Gaussian, the binomial, and the Poisson response families. Recently, the latent projection (Catalina et al. 2021) has extended the range of possible response families considerably, in particular to ordinal families relying on a single latent predictor (per observation), an example being the cumulative ordinal family from MASS::polr() (Venables and Ripley 2002). However, the latent projection is an approximate approach as it replaces the original projection problem with a latent projection problem. Here (Sect. 2), we present the exact solution to the original projection problem for discrete finite-support response families and call the corresponding procedure the augmented-data projection.

For investigating the performance of the augmented-data projection (Sect. 3), we confine ourselves to a simulation study comparing the augmented-data projection to the latent projection because the generally superior performance of the PPVS based on the traditional projection and based on the latent projection has already been demonstrated by Piironen and Vehtari (2017a) and Catalina et al. (2021), respectively.

We illustrate the application of the augmented-data projection in Sect. 4 by the help of a real-world example, thereby demonstrating another benefit of the augmented-data projection, namely the support for discrete and finite-support response families with more than one latent predictor per observation (these are response families which are not supported by the latent projection).

Finally, our work is discussed in Sect. 5, where we also mention possible modifications of the augmented-data projection to extend it to more response families in the future.

2 Augmented-data projection

2.1 Notation

For the following mathematical presentation of the augmented-data projection (a special case of the general approach that is presented first), we assume the availability of a dataset with \(N\) observations. The observed response vector will be denoted by \(\varvec{y} = (y_1, \dotsc , y_N)^{\scriptscriptstyle \textsf{T}} \in \mathcal {Y}^{N} \subseteq \mathbb {R}^{N}\). We do not introduce any notation for the corresponding predictor data as we will always be conditioning implicitly on it. By \(\varvec{\tilde{y}} = (\tilde{y}_1, \dotsc , \tilde{y}_N)^{\scriptscriptstyle \textsf{T}}\), we will denote unobserved response values at the same observed predictor values, with realizations in \(\mathcal {Y}^{N}\).

A crucial part (Piironen et al. 2020; Pavone et al. 2022) for the superior performance of the PPVS is the reference model, which is the best possible model (in terms of predictive performance) one can construct. For projpred, the reference model is usually fitted within rstanarm (Goodrich et al. 2023) or brms (Bürkner 2017, 2018) which both rely on Stan (Carpenter et al. 2017; Stan Development Team 2022b), a probabilistic programming language and software that is mainly used for its dynamic Hamiltonian Monte Carlo (HMC) algorithm, a modern Markov chain Monte Carlo (MCMC) sampler. However, the methodology behind projpred is more general and does not require the reference model to be fitted within rstanarm or brms. Thus, we start by assuming to have \(S^{*}\) draws \(\varvec{\theta }^{*}_{s} \in \varvec{\Theta }^{*}\) (\(s \in \{1, \dotsc , S^{*}\}\)) from the reference model’s posterior distribution, with \(\varvec{\Theta }^{*}\) denoting the reference model’s parameter space. Furthermore, for speeding up the computations, the PPVS assumes that these \(S^{*}\) posterior draws have been clustered or thinned so that \(\{1, \dotsc , S^{*}\} \supseteq \mathop {\bigcup }_{c = 1}^{C} \mathcal {I}^{*}_{c}\) with disjoint (and non-empty) index sets \(\mathcal {I}^{*}_{c}\). Important special cases are the single-cluster (\(C = 1\)) and the draw-by-draw (\(C = S^{*}\)) projection. If clustering (not thinning) is chosen, it is the intermediate case of \(C \in \{2, \dots , S^{*} - 1\}\) which requires an actual clustering of the posterior draws (an explanation how this clustering is performed in projpred will be given below). Based on the clustering (or thinning), we can restrict the reference model’s full posterior predictive distribution (for observation \(i \in \{1, \dots , N\}\)) to the \(c\)-th cluster of posterior draws (or the \(c\)-th thinned draw; \(c \in \{1, \dots , C\}\)):

$$\begin{aligned} p(\tilde{y}_i|\mathcal {I}^{*}_{c}) = \frac{1}{|\mathcal {I}^{*}_{c}|} \sum _{s \in \mathcal {I}^{*}_{c}} p(\tilde{y}_i|\varvec{\theta }^{*}_{s}). \end{aligned}$$

In doing so, the conditioning on an index set is slightly abusing notation, but we think it improves readability while at the same time reflecting the basic idea behind this empirical average. Note that by conditioning on \(\varvec{\theta }^{*}_{s}\), \(p(\tilde{y}_i|\varvec{\theta }^{*}_{s})\) implicitly conditions on the observed data (\(\varvec{y}\)) and hence \(p(\tilde{y}_i|\mathcal {I}^{*}_{c})\) conditions on \(\varvec{y}\) as well. Expectations with respect to \(p(\tilde{y}_i|\mathcal {I}^{*}_{c})\) will be denoted by \(\mathbb {E}(\cdot |\mathcal {I}^{*}_{c})\).

A model selection problem comes with several candidate models, of which we will consider only a single one here, to avoid cluttering notation. In the context of a variable selection problem, this candidate model may also be called a submodel of the full model which includes all predictors. The parameter space of this representative submodel will be denoted by \(\varvec{\Theta }\) and its parameter-conditional predictive distribution (i.e., its likelihood when regarded as a function of the parameters) by \(p(\tilde{y}_i|\varvec{\theta })\) (for \(\varvec{\theta } \in \varvec{\Theta }\)). We emphasize that in general, \(\varvec{\Theta }\) does not have to be related to \(\varvec{\Theta }^{*}\) in any form (in particular, it does not have to be a restricted subspace).

Finally, we need the Kullback–Leibler (KL) divergence (Kullback and Leibler 1951) from a distribution \(p(x)\) to a distribution \(q(x)\):

$$\begin{aligned} D_{\textrm{KL}}\!\left( p(x) \,\Vert \, q(x) \right) = \mathbb {E}_{p(x)}\!\left( \log \frac{p(x)}{q(x)}\right) \end{aligned}$$

where we have added the subscript \(p(x)\) to clarify the distribution that the expectation refers to.

For the clustering (and several other steps), projpred requires an invertible link function \(g\). With this link function \(g\), projpred performs the clustering of the \(S^{*}\) posterior draws by applying stats::kmeans() (the stats package is part of R) to the \(S^{*}\) length-\(N\) vectors \(\varvec{g}(\mathbb {E}(\varvec{\tilde{y}}|\varvec{\theta }^{*}_{s}))\) where \(\varvec{g}\) denotes the vectorized link function, i.e., the function which applies \(g\) to each element of a vector.

2.2 General approach

In general, the submodel’s projected parameter values for cluster (or thinned draw) \(c \in \{1, \dotsc , C\}\) are obtained by solving

$$\begin{aligned} \varvec{\theta }_{c} = \;&\mathop {\textrm{argmin}}\limits _{\varvec{\theta } \in \varvec{\Theta }} \frac{1}{N} \sum _{i = 1}^{N} D_{\textrm{KL}}\!\left( p(\tilde{y}_i|\mathcal {I}^{*}_{c}) \,\Vert \, p(\tilde{y}_i|\varvec{\theta }) \right) \nonumber \\ = \;&\mathop {\textrm{argmax}}\limits _{\varvec{\theta } \in \varvec{\Theta }} \sum _{i = 1}^{N} \mathbb {E}\!\left( \log p(\tilde{y}_i|\varvec{\theta }) \,|\, \mathcal {I}^{*}_{c}\right) , \end{aligned}$$
(1)

see Piironen et al. (2020).

This projection problem is not easy to solve in general because \(\mathbb {E}(\cdot |\mathcal {I}^{*}_{c})\) is an expectation with respect to \(p(\tilde{y}_i|\mathcal {I}^{*}_{c})\). Equation (1) simplifies a lot if the submodel’s response family follows the definition from McCullagh and Nelder (1989, equation (2.4)) because in that case, \(\log p(\tilde{y}_i|\varvec{\theta })\) is linear in \(\tilde{y}_i\), at least for optimization with respect to the non-dispersion parameters. Another simplifying case is \(|\mathcal {Y}| < \infty\), which is the gist here (see Sect. 2.3).

2.3 Discrete finite-support response families

In case of \(|\mathcal {Y}| < \infty\), Eq. (1) simplifies because \(\mathbb {E}(\cdot |\mathcal {I}^{*}_{c})\) is then a sum over all possible response values:

$$\begin{aligned} \varvec{\theta }_{c} = \;&\mathop {\textrm{argmax}}\limits _{\varvec{\theta } \in \varvec{\Theta }} \sum _{i = 1}^{N} \sum _{\tilde{y} \in \mathcal {Y}} a^{*}_{c, i, \tilde{y}} \log p(\tilde{y}_i = \tilde{y}|\varvec{\theta }) \end{aligned}$$
(2)

with \(a^{*}_{c, i, \tilde{y}} = p(\tilde{y}_i = \tilde{y}|\mathcal {I}^{*}_{c})\). Equation (2) is simply a weighted maximum-likelihood (ML) problem when using an augmented dataset where each observation is repeated \(|\mathcal {Y}|\) times and the response value is set to each possible value \(\tilde{y} \in \mathcal {Y}\) in turn so that the resulting augmented dataset has a total of \(N \cdot |\mathcal {Y}|\) rows. This approach is what we call the augmented-data projection, although for implementation in projpred, the augmented dataset is constructed internally to have \(|\mathcal {Y}|\) blocks of \(N\) rows instead of the other way round.

Equation (2) shows that the augmented-data projection consists of fitting to the fit of the reference model, a fundamental property already exhibited by the traditional projection (Piironen et al. 2020). In case of a discrete response family with finite support, the fit of the reference model just needs to be expressed differently, namely in terms of probabilities for all of the response categories, and fitting to that fit then needs to be done in a weighted fashion.

Due to the augmented-data projection being a weighted ML problem, the basic idea for implementing it in projpred is simply to apply existing R functions capable of performing a weighted ML estimation (e.g., MASS::polr() in case of the commonly used cumulative ordinal family which is described in detail in Sect. 3.1) to the augmented dataset. Currently, projpred’s augmented-data projection addsFootnote 2 support for the brms::cumulative() family (encoding the cumulative ordinal family), for rstanarm::stan_polr() fits (employing the cumulative ordinal family), and for the brms::categorical() family (encoding the nominal family used in multinomial logistic regression). We emphasize that these families refer to the submodels, not to the reference model. Typically, the reference model has the same response family as the submodels, but in general, the reference model is allowed to have a different family. In case of the augmented-data projection, the only requirement concerning the form of the reference model is that its response family must be discrete and have finite support (otherwise, the step from Eq. (1) to Eq. (2) would be incorrect). In theory, Eq. (2) does not require the submodel to have a discrete finite-support response family, but typically—and especially with respect to the implementation in projpred—this requirement makes sense.

The augmented-data projection has been added in version 2.4.0 of projpred (Piironen et al. 2023). In that version, an updated implementation of the latent projection (compared to Catalina et al. 2021) has been included as well. Note that for applying both—the augmented-data projection and the updated implementation of the latent projection—to reference model fits from brms, version 2.19.0 (or later) of brms is needed.

3 Simulation study

For the following simulation study comparing augmented-data and latent projection, we assume that the reader is familiar with the typical projpred workflow, as presented in the main vignette of the projpred package, for example.

3.1 Setup

Since the brms::categorical() family (see Sect. 2.3) comes with more than one latent predictor per observation (if \(|\mathcal {Y}| > 2\), which is the relevant case), it is not supported by the latent projection. Hence, our simulation study is restricted to the brms::cumulative() family (which encodes the same observation model as in rstanarm::stan_polr() fits, see Sect. 2.3).

More specifically, to comply with Catalina et al. (2021), we use \(J = |\mathcal {Y}| = 5\) response categories and the probit link function \(g = \Phi ^{-1}\) (the quantile function of the standard normal distribution). The number of observations is set to \(N = 100\), in accordance with the value used throughout the main article of Catalina et al. (2021).

Then, for each of \(R = 100\) simulation iterations, the simulation study involves the following steps:

  1. 1.

    Define the \(J - 1\) latent thresholds (intercepts) \(\zeta _{j}\) (\(j \in \{1, \dotsc , J - 1\}\)) as

    $$\begin{aligned} \zeta _{j} = g\!\left( \frac{j}{J}\right) . \end{aligned}$$
  2. 2.

    Generate \(P = 50\) regression coefficients \(\beta _{p}\) (\(p \in \{1, \dotsc , P\}\)) according to a regularized horseshoe prior (Piironen and Vehtari 2017c). The underlying mechanism may be found in the R code for this simulation study (see the link at the end of this section). Here, we choose a global scale parameter of

    $$\begin{aligned} \tau _{0} = \frac{p_{0}}{P - p_{0}} \cdot \frac{\tilde{\sigma }}{\sqrt{N}} \end{aligned}$$

    with \(p_{0} = 10\) and

    $$\begin{aligned} \tilde{\sigma }^2 = \exp \!\left( \frac{1}{J} \sum _{j = 1}^{J} \log \tilde{\sigma }_{j}^2\right) = \root J \of {\prod _{j = 1}^{J} \tilde{\sigma }_{j}^2}, \end{aligned}$$

    where \(\tilde{\sigma }_{j}^2\) is calculated according to Section 3.5 of Piironen and Vehtari (2017c), taking the same thresholds \(\zeta _{j}\) as defined above and assuming a typical data point with a latent predictor of zero so that all response categories are equally likely (in analogy to the approach of Piironen and Vehtari (2017b), in case of the binomial family with the logit link). Here, we obtain an overall pseudo variance of \(\tilde{\sigma }^2 \approx 1.06^2\). For the Student-\(t\) slab of the regularized horseshoe prior, we choose \(100\) degrees of freedom (effectively yielding a Gaussian slab) and a scale parameter of \(1\).

  3. 3.

    Generate a training dataset according to the following data-generating model where \(i \in \{1, \dots , N\}\):

    $$\begin{aligned}&x_{i, p} \sim \mathcal {N}(0, 1) \quad (p \in \{1, \dots , P\}),\\&\eta _i = \sum _{p = 1}^{P} \beta _{p} x_{i, p},\\&\varvec{\zeta } = (\zeta _{1}, \dotsc , \zeta _{J - 1})^{\scriptscriptstyle \textsf{T}},\\&y_i \sim \textrm{Cumul}(\varvec{\zeta }, \eta _i), \end{aligned}$$

    where \(\mathcal {N}(\mu , \sigma )\) denotes a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) and \(\textrm{Cumul}(\varvec{\zeta }, \eta _i)\) denotes the distribution with probability mass function

    $$\begin{aligned} p(y_i = j|\varvec{\zeta }, \eta _i) = g^{-1}(\zeta _{j} - \eta _i) - g^{-1}(\zeta _{j - 1} - \eta _i) \end{aligned}$$

    for \(j \in \{1, \dots , J\} = \mathcal {Y}\), exploiting auxiliary elements \(\zeta _{0} = -\infty\) and \(\zeta _{J} = \infty\) and defining \(g^{-1}(-\infty ) = 0\) and \(g^{-1}(\infty ) = 1\) (as well as \(\pm \infty - b = \pm \infty\) for \(b \in \mathbb {R}\)).

  4. 4.

    Generate an independent test dataset using the same data-generating model and the same settings (in particular, the same number \(N\) of observations) as for the training data.

  5. 5.

    Fit a reference model to the training data, using the data-generating model as the data-fitting model, except that the prior for the thresholds \(\zeta _{j}\) (\(j \in \{1, \dotsc , J - 1\}\)) is set to \(\mathcal {N}(0, 2.5)\) in the data-fitting model. The reference model fit is performed by brms::brm(), using the cmdstanr (Gabry and Češnovar 2022) backend. We use the default of \(4\) Markov chains, each running \(1000\) warmup and \(1000\) post-warmup iterations. In order to avoid spurious divergences of Stan’s dynamic HMC sampler, we aim at smaller step sizes by setting adapt_delta = 0.99. By specifying init = 1, we narrow down the range that the initial parameter values are randomly drawn from (this was necessary to avoid that occasionally, some chains would initialize in an area of the parameter space with log posterior density numerically equal to \(-\infty\) or—shortly after initialization—would run into such an area). We checked the convergence of the Markov chains for an initial reference model fit (based on a dataset independent of those from the \(R = 100\) simulation iterations) by the help of common MCMC diagnostics (Betancourt 2018; Vehtari et al. 2021; Stan Development Team 2022a; Bürkner et al. 2023).

  6. 6.

    Run the following projpred steps twice (once with the augmented-data projection and once with the latent projection, but based on the same training and test data and based on the same reference model fit):

    1. (a)

      Run projpred::varsel(), which consists of two main steps: a heuristic search and the calculation of precursor quantities required for the estimation of predictive performance statistics along the predictor ranking returned by the search (see the next enumeration point, 6(b)). In doing so, we pass the test data via argument d_test. As search method, we choose the forward search because projpred’s augmented-data projection currently does not support the L1 search and also because the L1 search is often less accurate. Apart from that, we leave all other arguments at their default.

    2. (b)

      For each submodel size along the predictor ranking.Footnote 3: Retrieve the mean log predictive density (MLPD; actually mean log predictive probability, but the same acronym is used for simplicity), \(\Delta \textrm{MLPD} = \textrm{MLPD} - \textrm{MLPD}^{*}\) (with \(\textrm{MLPD}^{*}\) denoting the reference model MLPD), and the corresponding standard errors (SEs). This is achieved via projpred:::summary.vsel() applied to the output of projpred::varsel() from the previous enumeration point (6(a)), once with deltas = FALSE (for the MLPD) and once with deltas = TRUE (for \(\Delta \textrm{MLPD}\)). Here, the MLPD is the chosen performance statistic because of the desirable properties of the log score in general (Vehtari and Ojanen 2012) and because \(\exp (\textrm{MLPD})\), the geometric mean predictive density (GMPD), has an interpretable scale of \((0, 1)\) in case of a discrete response family. We denote the MLPD based on the augmented-data projection by \(\textrm{MLPD}_{\textrm{aug}}\) and the corresponding \(\Delta \textrm{MLPD}\) value by \(\Delta \textrm{MLPD}_{\textrm{aug}}\). For the latent projection, these are denoted by \(\textrm{MLPD}_{\textrm{lat}}\) and \(\Delta \textrm{MLPD}_{\textrm{lat}}\), respectively.

    3. (c)

      Suggest a submodel size via projpred::suggest_size(). As underlying performance statistic, we choose the MLPD again, for consistency with the results retrieved from projpred:::summary.vsel(). We denote the suggested size based on the augmented-data projection by \(G_{\textrm{aug}}\) and the suggested size based on the latent projection by \(G_{\textrm{lat}}\). Since we use the default arguments of projpred::suggest_size() (apart from the predictive performance statistic which is set to "mlpd"), \(G_{\textrm{aug}}\) is given by the smallest submodel size for which \(\Delta \textrm{MLPD}_{\textrm{aug}} + \textrm{SE}(\Delta \textrm{MLPD}_{\textrm{aug}}) \ge 0\) (and \(G_{\textrm{lat}}\) is defined analogously). If there is no submodel size fulfilling this criterion, NA is returned.

The R code for this simulation study is available on GitHub.Footnote 4 Figures were created with ggplot2 (Wickham 2016).

3.2 Results

A central part of the projpred workflow is the plot of the chosen performance statistic (relative to the reference model’s performance) in dependence of the submodel size. Basically, this is also what is shown in Fig. 1, but slightly adapted to a simulation study: The lines from all \(R = 100\) simulation iterations are combined into one plot for the augmented-data and the latent projection, respectively. To avoid an overly crowded plot, the uncertainty bars that are otherwise part of this plot have been omitted.

Fig. 1
figure 1

Relative predictive performance at increasing submodel sizes for the augmented-data projection (top) and the latent projection (bottom) in all \(R = 100\) simulation iterations. Here, “relative” means that the left y-axis shows \(\Delta \textrm{MLPD} = \textrm{MLPD} - \textrm{MLPD}^{*}\) (with \(\textrm{MLPD}^{*}\) denoting the reference model MLPD). The right y-axis is simply the \(\exp (\cdot )\) scale, i.e., it shows \(\textrm{GMPD} / \textrm{GMPD}^{*}\) (with \(\textrm{GMPD}^{*}\) denoting the reference model GMPD). Each line represents one simulation iteration. The median \(\textrm{MLPD}^{*}\) across all \(R = 100\) simulation iterations is about \(-1.4\) (minimum: \(-1.7\), first quartile: \(-1.6\), third quartile: \(-1.1\), maximum: \(-0.6\)). The median \(\textrm{GMPD}^{*}\) across all \(R = 100\) simulation iterations is about 0.24 (minimum: 0.19, first quartile: 0.20, third quartile: 0.34, maximum: 0.54)

A reassuring conclusion from Fig. 1 is that for both projection methods, an increasing submodel size eventually causes the predictive performance of the submodels to approach that of the reference model, although there are simulation iterations where a certain discrepancy to the reference model performance persists even at large submodel sizes. Nevertheless, we can conclude that both projection methods pass a basic check for being implemented correctly.

Figure 1 also shows that in some simulation iterations, the augmented-data projection’s MLPDs at small to moderate submodel sizes are closer to the reference model MLPD than those from the latent projection. This is even more evident from Fig. 2 where \(\Delta \textrm{MLPD}_{\textrm{lat}} - \Delta \textrm{MLPD}_{\textrm{aug}} = \textrm{MLPD}_{\textrm{lat}} - \textrm{MLPD}_{\textrm{aug}}\) is illustrated. Figure 2 also reveals that there are a few simulation iterations where the latent projection leads to a better predictive performance at large submodel sizes. These simulation iterations are investigated in more detail in Appendix A (Online Resource 2).

An inspection of the MLPD (or rather GMPD) values on absolute scale in Appendix B (Online Resource 2) reveals that in extreme cases, the discrepancy in predictive performance between augmented-data and latent projection is indeed non-negligible.

Fig. 2
figure 2

Predictive performance based on the latent projection minus predictive performance based on the augmented-data projection, for increasing submodel sizes and all \(R = 100\) simulation iterations (represented by lines). The right y-axis is simply the \(\exp (\cdot )\) scale of the left y-axis. Note that for a given simulation iteration, the predictor ranking can differ between the augmented-data and the latent projection

The lack of uncertainty bars in Figs. 1 and 2 obscures the fact that all underlying predictive performance values are only estimates. Thus, it is important to inspect, for example, the corresponding standard errors (SEs). This is achieved by Fig. 3 which depicts the differences \(\textrm{SE}(\Delta \textrm{MLPD}_{\textrm{lat}}) - \textrm{SE}(\Delta \textrm{MLPD}_{\textrm{aug}})\). The mostly positive differences in Fig. 3 show that the latent projection is associated with greater uncertainty than the augmented-data projection. Analogously to the peaks at large submodel sizes from Fig. 2, there are latent-projection SEs at large submodel sizes which are noticeably smaller than their counterparts based on the augmented-data projection. As a side-effect, Appendix A (Online Resource 2) reveals that the SEs from one of the simulation iterations investigated there are part of this rare case.

Fig. 3
figure 3

Standard error (SE) in relative predictive performance based on the latent projection minus the same SE based on the augmented-data projection, for increasing submodel sizes and all \(R = 100\) simulation iterations (represented by points and summarized by boxplots)

In the typical projpred workflow, the plot of the chosen performance statistic in dependence of the submodel size is mainly used in the decision for a submodel size for the final projection. Ideally, this plot-based decision is made manually by incorporating subject-matter knowledge, application-specific trade-offs, and the absolute scale of the predictive performance statistic. In a real-world application, the heuristic offered by projpred::suggest_size() should only be interpreted as a suggestion, but for the purpose of a simulation study, such a heuristic is helpful. Figure 4 illustrates the frequency (across the simulation iterations) of all encountered differences \(G_{\textrm{lat}} - G_{\textrm{aug}}\) of the sizes \(G_{\textrm{lat}}\) and \(G_{\textrm{aug}}\) suggested by this heuristic. The high peak of the distribution at zero shows that the augmented-data and the latent projection often result in the same suggestion for the submodel size. Moreover, the slight right-skewness of the distribution (i.e., the presence of a few large positive differences) indicates that there are some simulation iterations where the latent projection leads to a clearly larger suggested size than the augmented-data projection. This slower convergence of the submodel MLPDs towards the reference model MLPD in case of the latent projection was already visible more directly in Figs. 1 and 2. It is also reflected (indirectly) by the larger frequency of \(\texttt {NA}_{\textrm{lat}}\) compared to \(\texttt {NA}_{\textrm{aug}}\) in Fig. 4. A first glance at Figs. 1 and 2 might lead to think that larger suggested sizes in case of the latent projection should be more frequent than they are, but uncertainty needs to be taken into account, too: The bigger SEs in case of the latent projection (Fig. 3) may cause the latent projection to arrive at similar suggested sizes as the augmented-data projection, even if the latent-projection submodel MLPDs approach the reference model MLPD more slowly.

Fig. 4
figure 4

Suggested submodel size based on the latent projection (\(G_{\textrm{lat}}\)) minus suggested submodel size based on the augmented-data projection (\(G_{\textrm{aug}}\)). In cases where projpred::suggest_size() was not able to suggest a size (e.g., because the forward search was terminated before the submodel MLPD could approach the reference model MLPD sufficiently), NA was returned (\(\texttt {NA}_{\textrm{aug}}\): NA for the augmented-data projection only, \(\texttt {NA}_{\textrm{lat}}\): NA for the latent projection only, \(\texttt {NA}_{\textrm{both}}\): NA for the augmented-data projection as well as for the latent projection)

The slower convergence towards the reference model MLPD in case of the latent projection is also visible in a slight left-skewness (with peak around zero) of the distribution of \(\textrm{MLPD}_{\textrm{lat}} - \textrm{MLPD}_{\textrm{aug}}\) at submodel size \(G_{\textrm{min}} = \min (G_{\textrm{aug}}, G_{\textrm{lat}})\) (provided at least one of \(G_{\textrm{aug}}\) and \(G_{\textrm{lat}}\) is non-NA) across all simulation iterations (Fig. 5).

Fig. 5
figure 5

Predictive performance based on the latent projection minus predictive performance based on the augmented-data projection, at size \(G_{\textrm{min}} = \min (G_{\textrm{aug}}, G_{\textrm{lat}})\). The total of 97 instead of \(R = 100\) simulation iterations is caused by 3 iterations where both projection methods resulted in a suggested size of NA. The top x-axis is simply the \(\exp (\cdot )\) scale of the bottom x-axis

Finally, Fig. 6 shows the runtime of the projpred::varsel() call for both projection methods. Clearly, the augmented-data projection takes much longer (median runtime across all simulation iterations: ca. 14.6 min) than the latent projection (median runtime across all simulation iterations: ca. 1.5 min). This is the price to pay for the exact projection instead of the approximate latent projection.

Fig. 6
figure 6

Runtime (in minutes) of projpred::varsel() based on the augmented-data and the latent projection, for all \(R = 100\) simulation iterations (represented by points and summarized by boxplots)

4 Example: Renal cell carcinoma subtyping

We illustrate the application of the augmented-data projection embedded in a PPVS for a nominal response variable using a cancer dataset from the Institute of Pathology of the Rostock University Medical Center (Germany). This dataset is available in Online Resource 1 and consists of those 285 observations (patients) with complete records from the larger dataset used by Zimpfer et al. (2019).

Zimpfer et al. (2019) conducted a retrospective study for renal cell carcinoma (RCC) subtyping in accordance with the 2016 WHO classification. RCC subtyping is of prognostic relevance for patients and thus crucial to be determined accurately. In Zimpfer et al. (2019), RCC subtyping was performed histologically by trained pathologists. Our data contains three RCC subtypes: clear-cell RCC (relative frequency: ca. \(86.0\,\%\)), papillary RCC (ca. \(9.5\,\%\)), and a set of rare (WHO-unclassified) subtypes (ca. \(4.6\,\%\)).

Despite the focus on determining the RCC subtype accurately, it is also helpful to predict the RCC subtype as early as possible during the process of patient care. Thus, we apply a PPVS with the three-level RCC subtype as response. On the side of the predictors, our reference model consists of the main effects and all possible two-way interactions of the following seven predictor variables which were chosen based on Table 2 of Zimpfer et al. (2019):

  • age: age at diagnosis (in years),

  • sex: sex ("female" or "male"),

  • grade: histologic tumor grade (coded as "G1G2" for grades G1–G2 and "G3G4" for G3–G4),

  • stage: histologic tumor stage (coded as "T1T2" for stages T1–T2 and "T3T4" for T3–T4),

  • nodes: nodal metastases spread nearby (coded as "no" for N0 and "yes" for N1),

  • metastases: metastases 0-6 months post-diagnosis (coded as "no" for M0 and "yes" for M1),

  • resection: classification of the resection margin (coded as "R0" for R0 and "R1R2" for R1–R2).

This gives \(7 + \left( {\begin{array}{c}7\\ 2\end{array}}\right) = 28\) predictors per response category (except for the reference category), a situation that—given the 285 observations available here—does not really belong to the “small \(n\), large \(p\)” regime mentioned by Piironen et al. (2020), but the “small \(n\), large \(p\)” regime is only the regime the PPVS was originally applied to. In practice, the PPVS may as well be applied to “small \(n\), small \(p\)” or other regimes.

In the following, we only describe modeling choices deviating from the defaults of the respective R function arguments. For details on the whole procedure, see the R code provided in Online Resource 1.

For fitting the reference model, we use the brms::categorical() response family from the R package brms. For the regression coefficients, we choose the R2-D2 prior (Zhang et al. 2022) as implemented in brms. In case of the brms::categorical() family, the R2-D2 prior’s \(R^2\) parameter does not have an intuitive interpretation (in contrast to normal linear models), but smaller \(R^2\) values still imply a stronger penalization. Here, we choose a mean of \(0.4\) and a pseudo-precision parameter of \(2.5\) for the Beta prior on \(R^2\), so slightly more penalization than implied by the default uniform Beta prior. The current implementation of the R2-D2 prior in brms requires a comparable scale of the predictors (except if differing scales have a meaning with respect to the relevance of predictors, in the sense that predictors with a larger scale should be more relevant, which we don’t assume here). Thus, as suggested by Gelman et al. (2008), we scale the only continuous predictor variable age to a standard deviation of \(0.5\) (which corresponds to the standard deviation of a binary predictor with a relative frequency of \(50\,\%\) for both categories). Prior to scaling age, we center it to a mean of \(0\).

The convergence of the Markov chains in the brms reference model fit seems to be given: All checks that we already performed in the simulation study (Sect. 3.1) are passed. Furthermore, we conduct some basic checks for the reference model to be appropriate from a predictive point of view. These checks (not shown here, but reported in Online Resource 1) reveal that the reference model’s predictions are largely driven by the intercepts. (In a brms::categorical() model, the intercepts transformed to response scale—i.e., to probabilities—reflect the hypothetical frequencies of the response categories at predictor values of zero.) In this sense, the reference model (or rather the data it is based upon) is suboptimal, but still sufficient for illustrative purposes.

Within projpred, we perform the PPVS using a \(K\)-fold cross-validation (\(K\)-fold CV), here with \(K = 30\). Based on a preliminary projpred::cv_varsel() run with Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV, (Vehtari et al. 2017, 2022)) and a full-data search (i.e., a search that was not run separately for each CV fold), we restrict the maximum submodel size for the fold-wise searches in the final projpred::cv_varsel() run (the \(K\)-fold one) to \(3\), thereby saving computational resources.

The whole projpred part of our code takes approximately 15 min on a standard desktop machine. The final projpred::cv_varsel() run yields the predictive performance plot depicted in Fig. 7.

Fig. 7
figure 7

Relative predictive performance for increasing submodel sizes in the RCC example from Sect. 4. Here, “relative” means that the left y-axis shows \(\Delta \textrm{MLPD} = \textrm{MLPD} - \textrm{MLPD}^{*}\) (with \(\textrm{MLPD}^{*}\) denoting the reference model MLPD). The right y-axis is simply the \(\exp (\cdot )\) scale, i.e., it shows \(\textrm{GMPD} / \textrm{GMPD}^{*}\) (with \(\textrm{GMPD}^{*}\) denoting the reference model GMPD). The vertical uncertainty bars indicate \(\textrm{SE}(\Delta \textrm{MLPD})\) (one such SE to either side of the point estimate)

Based on Fig. 7, we choose a submodel size of \(2\). The heuristic implemented in projpred::suggest_size() would have given a size of \(1\) (because size \(1\) is the smallest size where the submodel MLPD point estimate is less than one standard error smaller than the reference model MLPD point estimate). Here, we choose the slightly bigger size of \(2\) due to the special medical context where the primary goal is predictive accuracy, and sparsity being a secondary goal.

The summary of the fold-wise predictor rankings presented in Table 1 shows that all \(K = 30\) CV folds agree on the first two predictors: metastases and nodes (in this order). Thus, our selected submodel consists of these two predictors. After a final projection of the reference model onto this submodel (this time using the draw-by-draw method, i.e., projecting each posterior draw from the reference model onto the submodel parameter space without any clustering), we can make predictions with this submodel. These predictions are presented in Table 2 (this compact form is possible here because there are only two binary predictors). Although the absolute changes in the predictive probabilities might at first seem quite large (up to about \(22\,\%\) when changing only one predictor at a time, and up to about \(27\,\%\) when changing both predictors simultaneously), the predictive probabilities are still dominated by the empirical frequencies of the response categories in the data and thus by the intercepts. As mentioned above, this was already observable in the reference model. Therefore, it is clear that this pattern is also visible here: Model selection cannot be expected to yield a model with better predictions than the reference model (Vehtari and Ojanen 2012; Piironen and Vehtari 2017a), especially in the context of projections which are essentially fitting to the fit of the reference model.

Table 1 Summary of the fold-wise predictor rankings
Table 2 All possible predictions from the final submodel

5 Discussion

We have presented how the projective part of the PPVS can be performed in case of a discrete response family with finite support. This augmented-data projection has been implemented as an extension of the projpred R package.

Apart from the presentation of the methodology, the purpose of this paper was to compare the augmented-data projection to the latent projection, an alternative projection method that is far more general than the augmented-data projection and covers many discrete finite-support response families as well. The simulation study we have conducted to this end demonstrated that most of the time, the two projection methods behave quite similarly in terms of predictive performance and the submodel size found by the projpred::suggest_size() heuristic. In some cases, the augmented-data projection yields a better predictive performance and (although not necessarily in the same cases) a smaller suggested size than the latent projection. In even less frequent cases, it is the latent projection which yields a better predictive performance and a smaller suggested size.

Overall (i.e., across all simulation iterations), the predictive performance of the submodels and the variable selection based upon it seem to be more stable in case of the augmented-data projection. This is probably due to the exact nature of the augmented-data projection, as opposed to the approximate nature of the latent projection. For example, in case of the ordinal family used here, one reason for the worse stability of the latent projection could be that it uses the reference model’s draws of the threshold parameters to compute response-scale output (such as the response-scale MLPD) for a submodel: In general, the smaller the submodel size, the larger the lack of fit between the latent predictor of a submodel and the latent predictor of the reference model will be. When using an ad-hoc solution for computing (response-scale) predictive probabilities by relying on the reference model’s thresholds, a lack of fit in the latent predictor causes the predictive probabilities of a submodel to become suboptimal without the projection noticing this (and thus without the possibility for the projection to adjust the regression coefficients). In contrast, the augmented-data projection aims at reproducing directly the predictive probabilities of the reference model, adjusting both, the regression coefficients and the thresholds of a submodel. In principle, the latent projection also allows to calculate the predictive performance statistic(s) and other post-projection quantities on latent scale. By converting the results from the augmented-data projection to latent scale as well, we could have tried to compare the augmented-data and the latent projection on latent scale. However, in settings like ours where there is an independent test dataset (and the same applies to \(K\)-fold CV), it is not straightforward to define how the latent-scale predictions for the test dataset should be calculated (using the reference model fit based on the training data would induce a dependency between training and test data). Furthermore, latent-scale performance statistics like the latent-scale MLPD are not easily interpretable. Hence, we did not perform latent-scale analyses in our simulation study.

The MLPD was the only predictive performance statistic in our simulation study. In principle, the classification accuracy could be used as an alternative performance statistic in discrete finite-support observation models. However, especially in case of a moderate to large number of response categories (like the \(J = 5\) categories in our simulation study), this comes with a loss of information that the MLPD does not exhibit: For example, if the true response category of an observation is category \(3\) (out of \(5\)) and a model gives a predictive probability of \(23\,\%\) for category \(3\), a predictive probability of \(24\,\%\) for category \(4\), and predictive probabilities smaller than \(23\,\%\) for all other categories, then the prediction of the highest-probability category would lead to a misclassification in the zero–one utility spirit of the classification accuracy. The MLPD is smoother in the sense that the log predictive probability of that observation is \(\log (0.23)\), which would not differ much from the log predictive probability of \(\log (0.24)\) in a situation where the predictive probabilities for categories \(3\) and \(4\) were reversed. In any case, even if the accuracy may be considered appropriate in some use cases (after all, the choice of performance statistic is an application-specific one), we do not expect our main conclusions to change significantly in case of alternative performance statistics.

The cost of the augmented-data projection’s higher stability is a considerable increase in runtime. Because of this, it might be helpful to use the latent projection for preliminary results in the model-building workflow and to use the augmented-data projection afterwards for final results. One particular purpose of a preliminary latent-projection run could be to find a reasonable value for argument nterms_max of projpred::varsel() or projpred::cv_varsel() (this argument determines up to which submodel size the search should be conducted) because often, nterms_max can be chosen smaller than the value implied by the default heuristic, which reduces the runtime for the final augmented-data projection significantly.

The only number of response categories investigated in our simulation study was \(J = 5\), which may be regarded as overly restrictive. In Appendix C (Online Resource 2), we show how our results would have changed in case of \(J = 3\) and \(J = 7\). Most importantly, our main conclusion (the recommendation of using the latent projection for preliminary results and the augmented-data projection for final results) would not have changed.

An advantage of the augmented-data projection that was shortly mentioned in Sect. 2.3 and later illustrated in the example from Sect. 4 is the support for nominal families (like brms::categorical()) which rely on more than one latent predictor per observation (typically, nominal families come with one observation-wise latent predictor for each response category except for a reference category). So far, such families are not supported by the latent projection.

It is worth mentioning that just like the traditional PPVS, the PPVS based on the augmented-data projection is—by construction—sensitive to the choice of reference model. For the traditional PPVS, this is demonstrated, e.g., by McLatchie et al. (2023). For our cancer subtyping example from Sect. 4, this is demonstrated in Appendix D (Online Resource 2).

In the future (and if requested by users), the implementation of the augmented-data projection in projpred can be extended to more exotic discrete finite-support response families in a straightforward manner (see Sect. 2.3).

Furthermore, the augmented-data projection might also be applicable to continuous response families and discrete families with infinite support, using either a Monte Carlo or a discretization approach for achieving an artificial support that is discrete and finite. The Monte Carlo approach might require a clustering or some other kind of grouping of the response draws to arrive at a practicable number of response categories. For the discretization approach, it might be possible to borrow ideas from Röver and Friede (2017).

Finally, we note that the augmented-data projection in projpred also supports multilevel models. Since the PPVS for multilevel models (in general) is currently subject to more detailed investigations, we leave the comparison of augmented-data and latent projection for multilevel models for future research.

6 Supplementary information

The following Online Resources are available for this article:

Online Resource 1: Example files. The files (dataset, code, and additional output) for the example from Sect. 4.

Online Resource 2: Appendices. A single document containing multiple appendices with further simulation results and a sensitivity analysis for the reference model from Sect. 4.