Projection predictive variable selection for discrete response families with finite support

The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback-Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, in such cases, we recommend the latent projection in the early phase of a model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection.


Introduction
The projection predictive variable selection (Piironen et al., 2020;Catalina et al., 2022) is a special predictive model selection method (Vehtari and Ojanen, 2012) for Bayesian regression models that comes with valid post-selection inference (disregarding the selection of the final model size) and has been shown to perform better-in general-than alternative methods (Piironen and Vehtari, 2017a).It is based on the Bayesian decision-theoretical variable selection framework by Lindley (1968) and the practical draw-by-draw Kullback-Leibler (KL) projection proposed by Goutis and Robert (1998) and Dupuis and Robert (2003).So far, the implementation of the projection predictive variable selection in the R (R Core Team, 2023) package projpred 1 (Piironen et al., 2023) has been restricted to the Gaussian, the binomial, and the Poisson response families.Recently, the latent projection (Catalina et al., 2021) has extended the range of possible response families considerably, for example, to the ordinal family underlying MASS::polr() (Venables and Ripley, 2002).However, the latent projection is an approximate approach as it replaces the original projection problem with a latent projection problem.Here (section 2), we present the exact solution to the original projection problem for discrete finite-support response families and call the corresponding procedure the augmented-data projection.
For investigating the performance of the augmented-data projection (section 3), we confine ourselves to a simulation study comparing the augmented-data projection to the latent projection because the generally superior performance of the projection predictive variable selection based on the traditional projection and based on the latent projection has already been demonstrated by Piironen and Vehtari (2017a) and Catalina et al. (2021), respectively.
We illustrate the application of the augmented-data projection in section 4 by the help of a real-world example, thereby also demonstrating another benefit of the augmented-data projection, namely the support for some response families which are not supported by the latent projection.
Finally, our work is discussed in section 5, where we also mention possible modifications of the augmented-data projection to extend it to more response families in the future.

Notation
For the following mathematical presentation of the augmented-data projection (a special case of the general approach that is presented first), we assume the availability of a dataset with N observations.The observed response vector will be denoted by y = (y 1 , . . ., y N ) T ∈ Y N ⊆ R N .We do not introduce any notation for the corresponding predictor data as we will always be conditioning implicitly on it.By ỹ = (ỹ 1 , . . ., ỹN ) T , we will denote unobserved response values at the same observed predictor values, with realizations in Y N .
A crucial part (Piironen et al., 2020;Pavone et al., 2022) for the superior performance of the projection predictive variable selection is the reference model, which is the best possible model (in terms of predictive performance) one can construct.For projpred, the reference model is usually fitted within rstanarm (Goodrich et al., 2023) or brms (Bürkner, 2017(Bürkner, , 2018) ) which both rely on Stan (Carpenter et al., 2017;Stan Development Team, 2022b), a probabilistic programming language and software that is mainly used for its dynamic Hamiltonian Monte Carlo (HMC) algorithm, a modern Markov chain Monte Carlo (MCMC) sampler.However, the methodology behind projpred is more general and does not require the reference model to be fitted within rstanarm or brms.Thus, we start by assuming to have S * draws θ * s ∈ Θ * (s ∈ {1, . . ., S * }) from the reference model's posterior distribution, with Θ * denoting the reference model's parameter space.Furthermore, we assume that these S * posterior draws have been clustered or thinned so that {1, . . ., S * } ⊇ C c=1 I * c with disjoint index sets I * c .An explanation how the clustering is performed in projpred will be given below.Based on the clustering (or thinning), we can define the reference model's c-restricted posterior predictive distribution (for observation i): In doing so, the conditioning on an index set is slightly abusing notation, but we think it improves readability while at the same time reflecting the basic idea behind this empirical average.Expectations with respect to p(ỹ i |I * c ) will be denoted by E(•|I * c ).A model selection problem comes with several candidate models, of which we will consider only a single one here, to avoid cluttering notation.In the context of a variable selection problem, this candidate model may also be called a submodel of the full model which includes all predictors.The parameter space of this representative submodel will be denoted by Θ and its parameter-conditional predictive distribution (i.e., its likelihood when regarded as a function of the parameters) by p(ỹ i |θ) (for θ ∈ Θ).We emphasize that in general, Θ does not have to be related to Θ * in any form (in particular, it does not have to be a restricted subspace).
Finally, we need the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) from a distribution p(x) to a distribution q(x): where we have added the subscript p(x) to clarify the distribution that the expectation refers to.
For the clustering (and several other steps), projpred requires an invertible link function g.With this link function g, projpred performs the clustering of the S * posterior draws by applying stats::kmeans() (the stats package is part of R) to the S * length-N vectors g(E(ỹ|θ * s )) where g denotes the vectorized link function, i.e., the function which applies g to each element of a vector.
This projection problem is not easy to solve in general because E(  and Nelder (1989, equation (2.4)) because in that case, log p(ỹ i |θ) is linear in ỹi , at least for optimization with respect to the non-dispersion parameters.Another simplifying case is |Y| < ∞, which is the gist here (see section 2.3).

Discrete finite-support response families
) is then a sum over all possible response values: 2) is simply a weighted maximum-likelihood (ML) problem when using an augmented dataset where each observation is repeated |Y| times and the response value is set to each possible value ỹ ∈ Y in turn so that the resulting augmented dataset has a total of N • |Y| rows.This approach is what we call the augmented-data projection, although for implementation in projpred, the augmented dataset is constructed internally to have |Y| blocks of N rows instead of the other way round.Equation (2) shows that the augmented-data projection consists of fitting to the fit of the reference model, a fundamental property already exhibited by the traditional projection (Piironen et al., 2020).In case of a discrete response family with finite support, the fit of the reference model just needs to be expressed differently, namely in terms of probabilities for all of the response categories, and fitting to that fit then needs to be done in a weighted fashion.
Due to the augmented-data projection being a weighted ML problem, the basic idea for implementing it in projpred is simply to apply existing R functions capable of performing a weighted ML estimation (e.g., MASS::polr() in case of the commonly used cumulative ordinal models) to the augmented dataset.Currently, projpred's augmented-data projection adds support for the brms::cumulative() family, for rstanarm::stan_polr() fits, and for the brms::categorical() family.(These families are additional in comparison to projpred's traditional projection; projpred's latent projection already supports these families, except for the brms::categorical() family.)We emphasize that these families refer to the submodels, not the reference model.Typically, the reference model has the same response family as the submodels.In general, the reference model is allowed to have a different family.In case of the augmented-data projection, the only requirement concerning the form of the reference model is that its response family is discrete and has finite support (otherwise, the step from equation (1) to equation (2) would be incorrect).(In theory, equation (2) does not require the submodel to have a discrete finite-support response family, but typically-and especially with respect to the implementation in projpred-this requirement makes sense.) The augmented-data projection has been added in version 2.4.0 of projpred (Piironen et al., 2023).In that version, an updated implementation of the latent projection (compared to Catalina et al., 2021) has been included as well.Note that for applying both-the augmented-data projection and the updated implementation of the latent projection-to reference model fits from brms, version 2.19.0 (or later) of brms is needed.

Simulation study
For the following simulation study comparing augmented-data and latent projection, we assume that the reader is familiar with the typical projpred workflow, as presented in the main vignette of the projpred package, for example.

Setup
Since the latent projection does not support the brms::categorical() family, our simulation study is restricted to the brms::cumulative() family (which encodes the same observation model as in rstanarm::stan_polr() fits).More specifically, to comply with Catalina et al. (2021), we use J = |Y| = 5 response categories and the probit link function g = Φ −1 (the quantile function of the standard normal distribution).The number of observations is set to N = 100, in accordance with the value used throughout the main article of Catalina et al. (2021).
Then, for each of R = 100 simulation iterations, the simulation study involves the following steps: 1. Define the J − 1 latent thresholds (intercepts) ζ j (j ∈ {1, . . ., J − 1}) as 2. Generate P = 50 regression coefficients β p (p ∈ {1, . . ., P }) according to a regularized horseshoe prior (Piironen and Vehtari, 2017c).The underlying mechanism may be found in the R code for this simulation study (see the link at the end of this section).Here, we choose a global scale parameter of where σ2 j are calculated according to Section 3.5 of Piironen and Vehtari (2017c), taking the same thresholds ζ j as defined above and assuming a typical data point with a latent predictor of zero so that all response categories are equally likely (in analogy to the approach of Piironen and Vehtari, 2017b, in case of the binomial family with the logit link).Here, we obtain an overall pseudo variance of σ2 ≈ 1.06 2 .For the Student-t slab of the regularized horseshoe prior, we choose 100 degrees of freedom (effectively yielding a Gaussian slab) and a scale parameter of 1.
3. Generate a training dataset according to the following data-generating model where i ∈ {1, . . ., N }: x i,p ∼ N (0, 1) (p ∈ {1, . . ., P }), where N (µ, σ) denotes a normal distribution with mean µ and standard deviation σ and Cumul(ζ, η i ) denotes the distribution with probability mass function 4. Generate an independent test dataset using the same data-generating model and the same settings (in particular, the same number N of observations) as for the training data.
5. Fit a reference model to the training data, using the data-generating model as the data-fitting model, except that the prior for the thresholds ζ j (j ∈ {1, . . ., J − 1}) is set to N (0, 2.5) in the data-fitting model.The reference model fit is performed by brms::brm(), using the cmdstanr (Gabry and Češnovar, 2022) backend.We use the default of 4 Markov chains, each running 1000 warmup and 1000 post-warmup iterations.In order to avoid spurious divergences of Stan's dynamic HMC sampler, we aim at smaller step sizes by setting adapt_delta = 0.99.By specifying init = 1, we narrow down the range that the initial parameter values are randomly drawn from (this was necessary to avoid that occasionally, some chains would initialize in an area of the parameter space with log posterior density numerically equal to −∞ or-shortly after initialization-would run into such an area).We checked the convergence of the Markov chains for an initial reference model fit (based on a dataset independent of those from the R = 100 simulation iterations) by the help of common MCMC diagnostics (Betancourt, 2018;Vehtari et al., 2021;Stan Development Team, 2022a;Bürkner et al., 2023).
6. Run projpred.More specifically, the following steps are performed twice (once with the augmenteddata projection and once with the latent projection, but based on the same training and test data and based on the same reference model fit): (a) Run projpred::varsel(), specifying the test data via argument d_test.As search method, we choose the forward search because projpred's augmented-data projection currently does not support the L1 search and also because the L1 search is often less accurate.Apart from that, we leave all other arguments at their default.(b) For each submodel size along the solution path: Retrieve the mean log predictive density (MLPD; actually mean log predictive probability, but the same acronym is used for simplicity), ∆MLPD = MLPD − MLPD * (with MLPD * denoting the reference model MLPD), and the corresponding standard errors (SEs).This is achieved via projpred:::summary.vsel(),once with deltas = FALSE (for the MLPD) and once with deltas = TRUE (for ∆MLPD).Here, the MLPD is the chosen performance statistic because of the desirable properties of the log score in general (Vehtari and Ojanen, 2012) and because exp(MLPD), the geometric mean predictive density (GMPD), has an interpretable scale of (0, 1) in case of a discrete response family.We denote MLPD based on the augmented-data projection by MLPD aug and the corresponding ∆MLPD value by ∆MLPD aug .For the latent projection, these are denoted by MLPD lat and ∆MLPD lat , respectively.(c) Suggest a submodel size via projpred::suggest_size().As underlying performance statistic, we choose the MLPD again, for consistency with the results retrieved from projpred:::summary.vsel().We denote the suggested size based on the augmented-data projection by G aug and the suggested size based on the latent projection by G lat .
The R code for this simulation study is available on GitHub2 .Figures were created with ggplot2 (Wickham, 2016).

Results
A central part of the projpred workflow is the plot of the chosen performance statistic (relative to the reference model's performance) in dependence of the submodel size.Basically, this is also what is shown in Figure 1, but slightly adapted to a simulation study: The lines from all R = 100 simulation iterations are combined into one plot for the augmented-data and the latent projection, respectively.To avoid an overly crowded plot, the uncertainty bars that are otherwise part of this plot have been omitted.
A reassuring conclusion from Figure 1 is that for both projection methods, an increasing submodel size eventually causes the predictive performance of the submodels to approach that of the reference model, although there are simulation iterations where a certain discrepancy to the reference model performance persists even at large submodel sizes.Nevertheless, we can conclude that both projection methods pass a basic check for being implemented correctly.
Figure 1 also shows that in some simulation iterations, the augmented-data projection's MLPDs at small to moderate submodel sizes are closer to the reference model MLPD than those from the latent projection.This is even more evident from Figure 2 where ∆MLPD lat − ∆MLPD aug = MLPD lat − MLPD aug is illustrated.
Figure 2 also reveals that there are a few simulation iterations where the latent projection leads to a better predictive performance at large submodel sizes.These simulation iterations are investigated in more detail in Appendix A.
An inspection of the MLPD (or rather GMPD) values on absolute scale (Appendix B) reveals that in extreme cases, the discrepancy in predictive performance between augmented-data and latent projection is indeed non-negligible.
The lack of uncertainty bars in Figures 1 and 2 obscures the fact that all underlying predictive performance values are only estimates.Thus, it is important to inspect, for example, the corresponding standard errors (SEs).This is achieved by Figure 3 which depicts the differences SE(∆MLPD lat ) − SE(∆MLPD aug ).The mostly positive differences in Figure 3 show that the latent projection is associated with greater uncertainty than the augmented-data projection.Analogously to the peaks at large submodel sizes from Figure 2, there are latent-projection SEs at large submodel sizes which are noticeably smaller than their counterparts based on the augmented-data projection.As a side-effect, Appendix A reveals that the SEs from one of the simulation iterations investigated there are part of this rare case.
In the typical projpred workflow, the plot of the chosen performance statistic in dependence of the submodel size is mainly used in the decision for a submodel size for the final projection.Ideally, this plot-based decision is made manually by incorporating subject-matter knowledge, application-specific trade-offs, and the absolute scale of the predictive performance statistic.In a real-world application, the heuristic offered by projpred::suggest_size() should only be interpreted as a suggestion, but for the purpose of a simulation study, such a heuristic is helpful.Figure 4 illustrates the frequency (across the simulation iterations) of all encountered differences G lat − G aug of the sizes G lat and G aug suggested by this heuristic.The high peak of the distribution at zero shows that the augmented-data and the latent projection often result in the same suggestion for the submodel size.Moreover, the slight right-skewness of the distribution (i.e., the presence of a few large positive differences) indicates that there are some simulation iterations where the latent projection leads to a clearly larger suggested size than the augmented-data projection.This slower convergence of the submodel MLPDs towards the reference model MLPD in case of the latent projection was already visible more directly in Figures 1 and 2. It is also reflected (indirectly) by the larger frequency of NA lat compared to NA aug in Figure 4.A first glance at Figures 1 and 2 might lead to think that larger suggested sizes in case of the latent projection should be more frequent than they are, but uncertainty needs to be taken into account, too: The bigger SEs in case of the latent projection (Figure 3) may cause the latent projection to arrive at similar suggested sizes as the augmented-data projection, even if the latent-projection submodel MLPDs approach the reference model MLPD more slowly.
The slower convergence towards the reference model MLPD in case of the latent projection is also visible in a slight left-skewness (with peak around zero) of the distribution of MLPD lat − MLPD aug at submodel size G min = min(G aug , G lat ) (provided at least one of G aug and G lat is non-NA) across all simulation iterations (Figure 5).Finally, Figure 6 shows the runtime of the projpred::varsel() call for both projection methods.Clearly, the augmented-data projection takes much longer (median runtime across all simulation iterations: ca.14.6 minutes) than the latent projection (median runtime across all simulation iterations: ca.1.5 minutes).This is the price to pay for the exact projection instead of the approximate latent projection.

Example: Renal cell carcinoma subtyping
We illustrate the application of the augmented-data projection embedded in a projection predictive variable selection for a nominal response variable using a cancer dataset from the Institute of Pathology of the   pathologists.Our data contains three RCC subtypes: clear-cell RCC (relative frequency: ca.86.0 %), papillary RCC (ca.9.5 %), and a set of rare (WHO-unclassified) subtypes (ca.4.6 %).
Despite the focus on determining the RCC subtype accurately, it is also helpful to predict the RCC subtype as early as possible during the process of patient care.Thus, we apply a projection predictive variable selection with the three-level RCC subtype as response.On the side of the predictors, our reference model consists of the main effects and all possible two-way interactions of the following seven predictor variables which were chosen based on • age: age at diagnosis (in years), • sex: sex ("female" or "male"), • grade: histologic tumor grade (coded as "G1G2" for grades G1-G2 and "G3G4" for G3-G4), • stage: histologic tumor stage (coded as "T1T2" for stages T1-T2 and "T3T4" for T3-T4), • nodes: nodal metastases spread nearby (coded as "no" for N0 and "yes" for N1), • metastases: metastases 0-6 months post-diagnosis (coded as "no" for M0 and "yes" for M1), • resection: classification of the resection margin (coded as "R0" for R0 and "R1R2" for R1-R2).
In the following, we only describe modeling choices deviating from the defaults of the respective R function arguments.
For fitting the reference model, we use the brms::categorical() response family from the R package brms.
For the regression coefficients, we choose the R2-D2 prior (Zhang et al., 2022) as implemented in brms.
In case of the brms::categorical() family, the R2-D2 prior's R 2 parameter does not have an intuitive interpretation (in contrast to normal linear models), but smaller R 2 values still imply a stronger penalization.
Here, we choose a mean of 0.4 and a pseudo-precision parameter of 2.5 for the Beta prior on R 2 , so slightly more penalization than implied by the default uniform Beta prior.The current implementation of the R2-D2 prior in brms requires a comparable scale of the predictors (except if differing scales have a meaning with respect to the relevance of predictors, in the sense that predictors with a larger scale should be more relevant, which we don't assume here).Thus, as suggested by Gelman et al. (2008), we scale the only continuous predictor variable age to a standard deviation of 0.5 (which corresponds to the standard deviation of a binary predictor with a relative frequency of 50 % for both categories).Prior to scaling age, we center it to a mean of 0.
The convergence of the Markov chains in the brms reference model fit seems to be given: All checks that we already performed in the simulation study (section 3.1) are passed.Furthermore, we conduct some basic checks for the reference model to be appropriate from a predictive point of view.These checks (not shown here) reveal that the reference model's predictions are largely driven by the intercepts.(In a brms::categorical() model, the intercepts transformed to response scale-i.e., to probabilities-reflect the hypothetical frequencies of the response categories at predictor values of zero.)In this sense, the reference model (or rather the data it is based upon) is suboptimal, but still sufficient for illustrative purposes.
Within projpred, we perform the projection predictive variable selection using a K-fold cross-validation (K-fold CV), here with K = 30.Based on a preliminary projpred::cv_varsel() run with Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV, Vehtari et al., 2017Vehtari et al., , 2022) and a full-data search (i.e., a search that was not run separately for each CV fold), we restrict the maximum submodel size for the fold-wise searches in the final projpred::cv_varsel() run (the K-fold one) to 3, thereby saving computational resources.
The whole projpred part of our code takes approximately 15 minutes on a standard desktop machine.The final projpred::cv_varsel() run yields the predictive performance plot depicted in Figure 7.
Based on Figure 7, we choose a submodel size of 2. The heuristic implemented in projpred::suggest_size() would have given a size of 1 (because size 1 is the smallest size where the submodel MLPD point estimate is less than one standard error smaller than the reference model MLPD point estimate).Here, we choose the slightly bigger size of 2 due to the special medical context where the primary goal is predictive accuracy, and sparsity being a secondary goal.
The summary of the fold-wise solution paths presented in Table 1 shows that all K = 30 CV folds agree on the first two predictors: metastases and nodes (in this order).Thus, our selected submodel consists of these two predictors.After a final projection of the reference model onto this submodel (this time using the draw-by-draw method, i.e., projecting each posterior draw from the reference model onto the submodel parameter space without any clustering), we can make predictions with this submodel.These predictions are presented in Table 2 (this compact form is possible here because there are only two binary predictors).
Although the absolute changes in the predictive probabilities might at first seem quite large (up to about 22 % when changing only one predictor at a time, and up to about 27 % when changing both predictors simultaneously), the predictive probabilities are still dominated by the empirical frequencies of the response categories in the data and thus by the intercepts.As mentioned above, this was already observable in the reference model.Therefore, it is clear that this pattern is also visible here: Model selection cannot be expected Here, the predictions are probabilities for the categories of the response, the RCC subtype.The "rare" RCCs are the WHO-unclassified ones

Discussion
We have presented how the projective part of the projection predictive variable selection can be performed in case of a discrete response family with finite support.This augmented-data projection has been implemented as an extension of the projpred R package.
Apart from the presentation of the methodology, the purpose of this paper was to compare the augmenteddata projection to the latent projection, an alternative projection method that is far more general than the augmented-data projection and covers many discrete finite-support response families as well.The simulation study we have conducted to this end demonstrated that most of the time, the two projection methods behave quite similarly in terms of predictive performance and the submodel size found by the projpred::suggest_size() heuristic.In some cases, the augmented-data projection yields a better predictive performance and (although not necessarily in the same cases) a smaller suggested size than the latent projection.
In even less frequent cases, it is the latent projection which yields a better predictive performance and a smaller suggested size.
Overall (i.e., across all simulation iterations), the predictive performance of the submodels and the variable selection based upon it seem to be more stable in case of the augmented-data projection.This is probably due to the exact nature of the augmented-data projection, as opposed to the approximate nature of the latent projection.For example, in case of the ordinal family used here, one reason for the worse stability of the latent projection could be that it uses the reference model's draws of the threshold parameters to compute response-scale output (such as the response-scale MLPD) for a submodel: In general, the smaller the submodel size, the larger the lack of fit between the latent predictor of a submodel and the latent predictor of the reference model will be.When using an ad-hoc solution for computing (response-scale) predictive probabilities by relying on the reference model's thresholds, a lack of fit in the latent predictor causes the predictive probabilities of a submodel to become suboptimal without the projection noticing this (and thus without the possibility for the projection to adjust the regression coefficients).In contrast, the augmented-data projection aims at reproducing directly the predictive probabilities of the reference model, adjusting both, the regression coefficients and the thresholds of a submodel.In principle, the latent projection also allows to calculate the predictive performance statistic(s) and other post-projection quantities on latent scale.By converting the results from the augmented-data projection to latent scale as well, we could have tried to compare the augmented-data and the latent projection on latent scale.However, in settings like ours where there is an independent test dataset (and the same applies to K-fold CV), it is not straightforward to define how the latent-scale predictions for the test dataset should be calculated (using the reference model fit based on the training data would induce a dependency between training and test data).Furthermore, latent-scale performance statistics like the latent-scale MLPD are not easily interpretable.Hence, we did not perform latent-scale analyses in our simulation study.
MLPD was the only predictive performance statistic in our simulation study.In principle, the classification accuracy could be used as an alternative performance statistic in discrete finite-support observation models.However, especially in case of a moderate to large number of response categories (like the J = 5 categories in our simulation study), this comes with a loss of information that MLPD does not exhibit: For example, if the true response category of an observation is category 3 (out of 5) and a model gives a predictive probability of 23 % for category 3, a predictive probability of 24 % for category 4, and predictive probabilities smaller than 23 % for all other categories, then the prediction of the highest-probability category would lead to a misclassification in the zero-one utility spirit of the classification accuracy.MLPD is smoother in the sense that the log predictive probability of that observation is log(0.23),which would not differ much from the log predictive probability of log(0.24) in a situation where the predictive probabilities for categories 3 and 4 were reversed.In any case, even if the accuracy may be considered appropriate in some use cases (after all, the choice of performance statistic is an application-specific one), we do not expect our main conclusions to change significantly in case of alternative performance statistics.
The cost of the augmented-data projection's higher stability is a considerable increase in runtime.Because of this, it might be helpful to use the latent projection for preliminary results in the model-building workflow and to use the augmented-data projection afterwards for final results.One particular purpose of a preliminary latent-projection run could be to find a reasonable value for argument nterms_max of projpred::varsel() or projpred::cv_varsel() (this argument determines up to which submodel size the search should be conducted) because often, nterms_max can be chosen smaller than the value implied by the default heuristic, which reduces the runtime for the final augmented-data projection significantly.
An advantage of the augmented-data projection that was shortly mentioned in section 2.3 and later illustrated in the example from section 4 is the support for nominal families like brms::categorical().So far, such families are not supported by the latent projection.
In the future (and if requested by users), the implementation of the augmented-data projection in projpred can be extended to more exotic discrete finite-support response families in a straightforward manner (see section 2.3).
Furthermore, the augmented-data projection might also be applicable to continuous response families and discrete families with infinite support, using either a Monte Carlo or a discretization approach for achieving an artificial support that is discrete and finite.The Monte Carlo approach might require a clustering or some other kind of grouping of the response draws to arrive at a practicable number of response categories.For the discretization approach, it might be possible to borrow ideas from Röver and Friede (2017).
Finally, we note that the augmented-data projection in projpred also supports multilevel models.Since the projection predictive variable selection for multilevel models (in general) is currently subject to more detailed investigations, we leave the comparison of augmented-data and latent projection for multilevel models for future research.What is interesting in Figures A.1 and A.2 is that two of the three selected iterations (31 and 75) are among those where the latent projection performs extraordinarily badly at small submodel sizes.Thus, a bad performance of the latent projection at small submodel sizes does not necessarily imply a bad performance overall.However, such a catch-up of the latent projection does not help if the augmented-data projection leads to a predictive performance close to the reference model at a submodel size smaller than the size where the catch-up takes place.This is the case in iteration 75, but not in iteration 31.In iteration 31, the augmented-data projection does not manage to reach a predictive performance close to the reference model (at least not up to the maximum submodel size of 19 here implied by the default of argument nterms_max of projpred::varsel()), leaving a gap in predictive performance that the latent projection does not exhibit.Iteration 18 does not exhibit a pronounced catch-up of the latent projection, but a striking spread of the two curves at larger submodel sizes.Apparently, the augmented-data projection overfits after having attained the maximum predictive performance at size 6.The latent projection overfits after this point as well, but the dip in predictive performance is shorter and not that deep.For a judgment of the consequences of this advantage of the latent projection, it is again important to consider the submodel size where a sufficient predictive performance is reached (i.e., the submodel size which would typically be selected by the user): Here, the major advantage in predictive performance occurs after the point of sufficient predictive performance, and so it is not that relevant (unlike the advantage from iteration 31).However, there is also a minor advantage in predictive performance at submodel sizes 1 to 3, which is indeed relevant because it takes place before the point of sufficient predictive performance of the augmented-data projection and even causes the projpred::suggest_size() heuristic to suggest a submodel size that is smaller by three predictor terms.As a side-effect, Figure A.2 shows that for larger submodel sizes in iteration 31, the latent-projection SEs are smaller than their counterparts based on the augmented-data projection.This is a rather rare case (see Figure 3).

Appendix B Absolute-scale predictive performance
Figure 1 allowed us to compare the predictive performance relative to the reference model between both projection methods.For example, under the augmented-data projection, the submodel GMPD is always at least as large as 50 % of the reference model GMPD whereas under the latent projection, there are also several submodel GMPDs between ca. 25 % and 50 % of the reference model GMPD.
The aim of this section is now to investigate whether the discrepancies between augmented-data and latent projection are also relevant on absolute scale (i.e., not relative to the reference model).Unfortunately, Figure 1 cannot be modified easily to show the absolute scale of MLPD and GMPD.The reason is that the reference model performance varies from simulation iteration to simulation iteration so that in a plot where the lines from all simulation iterations are combined, there would not be a single dashed horizontal line for the reference model, but R = 100 ones.Thus, the only remedy is to inspect the results on absolute scale separately for a few simulation iterations.
To select a few iterations, we consider the difference GMPD lat − GMPD aug at size G min = min(G aug , G lat ) (in the same fashion as for Figure 5) and choose those iterations where this suggested-size GMPD difference is either extremely small or extremely large (taking three iterations from both extremes).
Tables B.1 and B.2 show the corresponding results at size G min .From Table B.1, we can infer that the augmented-data projection achieves an additive suggested-size GMPD improvement (compared to the latent projection) of up to 6.9 %, with the three largest of these improvements all being between 6 % and 7 %.Table B.2 shows that the latent projection achieves an additive suggested-size GMPD improvement of up to 6.3 % (similar to the maximum improvement achieved by the augmented-data projection), but the second and third largest improvements are considerably smaller than 6 %.In general, we would consider an additive GMPD improvement between 6 % and 7 % as relevant, remembering that a geometric mean gives the value that could be assigned to all factors of a product (here the joint predictive probability) to arrive at the same value of the product as when taking the original factors.The improvements of the augmented-data projection are persistent across all three iterations from the left column, whereas the latent projection leads to a clear advantage only in the most extreme iteration (the uppermost one) from the right column.This is iteration 31 that was already discussed in Appendix A. Iteration 18 (the lowermost one from the right column) was already discussed in Appendix A as well.Interestingly, iteration 75 (the middle one from the left column) comes with the second largest additive suggested-size GMPD improvement of the augmented-data projection, but was also discussed in Appendix A, meaning that it also comes with one of the largest MLPD improvements of the latent projection, but only when considering all submodel sizes.This is possible due to the crossing of the two curves in iteration 75 (Figure B.1), with the augmented-data projection achieving a predictive performance close to the reference model earlier than the latent projection.An explanation for the crossing of the two curves might be that the inclusion of at least one predictor (probably two predictors, see Figure B.1) causes the projection to overfit, and that the two projection methods include this predictor at different submodel sizes.
We may conclude that even at the preferable suggested size G min , the discrepancy between augmented-data and latent projection can be relevant on absolute scale, although not huge.Of course, the six simulation iterations were selected by cherry-picking extreme ones, but this was necessary to investigate how large the absolute-scale discrepancy at the preferable suggested size can get.To avoid a false impression, we repeat that most of the time, the predictive performance (also on absolute scale) is similar between the two projection methods (see Figure 2).

Figure 2 :Figure 3 :Figure 5 :
Figure2: Predictive performance based on the latent projection minus predictive performance based on the augmented-data projection, for increasing submodel sizes and all R = 100 simulation iterations (represented by lines).The right y-axis is simply the exp(•) scale of the left y-axis.Note that for a given simulation iteration, the solution path can differ between the augmented-data and the latent projection

Figure 6 :
Figure 6: Runtime (in minutes) of projpred::varsel() based on the augmented-data and the latent projection, for all R = 100 simulation iterations (represented by points and summarized by boxplots)

Figure 7 :
Figure 7: Relative predictive performance for increasing submodel sizes in the RCC example from section 4. Here, "relative" means that the left y-axis shows ∆MLPD = MLPD − MLPD * (with MLPD * denoting the reference model MLPD).The right y-axis is simply the exp(•) scale, i.e., it shows GMPD/GMPD * (with GMPD * denoting the reference model GMPD).The vertical uncertainty bars indicate SE(∆MLPD) (one such SE to either side of the point estimate)

Figure A. 1 :
Figure A.1: Predictive performance difference between augmented-data and latent projection in the three simulation iterations (31, 75, and 18; highlighted in green) with the largest values for MLPD lat − MLPD aug .Apart from the highlighting, this is the same as Figure2 Figure A.2: Relative predictive performance at increasing submodel sizes for the augmented-data projection (light red) and the latent projection (turquoise) in the three simulation iterations (represented by panels) with the largest values for MLPD lat − MLPD aug .The right y-axis is simply the exp(•) scale of the left y-axis.The vertical uncertainty bars indicate SE(∆MLPD)

Figure B. 1
Figure B.1 visualizes the absolute-scale predictive performance at all submodel sizes from the forward search (not only G min ) for all of these most extreme simulation iterations.That visualization confirms the conclusions from Tables B.1 and B.2: The improvements of the augmented-data projection are persistent across all three iterations from the left column, whereas the latent projection leads to a clear advantage only in the most extreme iteration (the uppermost one) from the right column.This is iteration 31 that was already discussed in Appendix A. Iteration 18 (the lowermost one from the right column) was already discussed in Appendix A as well.Interestingly, iteration 75 (the middle one from the left column) comes with the second largest additive suggested-size GMPD improvement of the augmented-data projection, but was also discussed in Appendix A, meaning that it also comes with one of the largest MLPD improvements of the latent projection, but only when considering all submodel sizes.This is possible due to the crossing of the two curves in iteration 75(Figure B.1), with the augmented-data projection achieving a predictive performance close to the reference model earlier than the latent projection.An explanation for the crossing of the two curves might be that the inclusion of at least one predictor (probably two predictors, see FigureB.1) causes the projection to overfit, and that the two projection methods include this predictor at different submodel sizes.
Figure B.1: Predictive performance at increasing submodel sizes for the augmented-data projection (light red) and the latent projection (turquoise) in those simulation iterations (represented by panels) where the "suggested-size GMPD difference" defined in the text is either extremely small (left three panels) or extremely large (right three panels).Both columns are sorted from most extreme (top) to least extreme (bottom).The right y-axis is simply the exp(•) scale of the left y-axis.The vertical uncertainty bars indicate SE(MLPD).The dashed horizontal lines indicate the predictive performance of the reference model

Table 1 :
Summary of the fold-wise solution paths.For a given submodel size and a given predictor term from the full-data solution path, the table shows the proportion of CV folds that have this predictor term at this size in their solution path.Only submodel sizes 1 to 3 are shown because the search was terminated intentionally at size 3 (see text)

Table 2 :
All possible predictions from the final submodel.

Table B .
1: Predictive performance at size G min = min(G aug , G lat ) for the three simulation iterations with the smallest values for GMPD lat − GMPD aug at size G min .Rows are sorted from most extreme (top) to least extreme (bottom) Sim.iter.GMPD aug GMPD lat GMPD lat − GMPD aug Predictive performance at size G min = min(G aug , G lat ) for the three simulation iterations with the largest values for GMPD lat − GMPD aug at size G min .Rows are sorted from most extreme (top) to least extreme (bottom) Sim.iter.GMPD aug GMPD lat GMPD lat − GMPD aug